## working with text data (Tokenize text)

The text we will tokenize for LLM training is “The Verdict”, we take this as an example for showing how tokenization works.

In [3]:
# load 'the-verdict.txt'
file_path = 'the-verdict.txt'

with open(file_path, 'r', encoding = 'utf-8') as f:
    raw_text = f.read()

print('total number of chars:', len(raw_text))
print(raw_text[:99])

total number of chars: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### Dealing with files: `open()`

`open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)`

- `file`: file path
- `mode`: 
    - `r`：只读（默认），文件必须存在。
    - `w`：写入模式，覆盖文件内容；若文件不存在则创建。
    - `a`：追加模式，在文件末尾写入；若文件不存在则创建。
    - `x`：排他性创建，文件存在则报错。
    - `b`：二进制模式（如 `rb`、`wb`）
    - `+`：更新模式，允许读写（如 `r+`、`w+`）。
- `encoding`：编码格式，如`utf-8`,`gbk`

---

在 Python 中，文本模式（如 `'r'` 或 `'w'`）和二进制模式（如 `'rb'` 或 `'wb'`）是文件操作的两种核心方式，它们的区别体现在**数据类型、编码处理、换行符转换**等方面。以下是详细对比：



### 一、核心区别总结
| **特性**         | **文本模式**（`'r'`/`'w'`）         | **二进制模式**（`'rb'`/`'wb'`）       |
|-------------------|-------------------------------------|---------------------------------------|
| **数据类型**      | 读写时自动转换为 `str` 类型（字符串） | 直接读写 `bytes` 类型（原始字节流）   |
| **编码处理**      | 依赖 `encoding` 参数（如 `utf-8`）    | **无编码转换，直接处理字节**             |
| **换行符转换**    | 自动将 `\r\n` 转换为 `\n`（仅读操作） | 不转换，保持文件原始字节内容          |
| **适用场景**      | 文本文件（如 `.txt`, `.csv`）        | 非文本文件（如图片、视频、二进制数据）|



In [16]:
# dealing with text with regular expression
import re
text = 'Hello, word. This, is a text.'
result = re.split(r'(\s)', text) # \s for 'space', split on whitw space
print(result)

['Hello,', ' ', 'word.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'text.']


In [17]:
# split on \s , . <- periods
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'word', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'text', '.', '']


A small remaining problem is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows

In [18]:
# remove white space from the list
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'word', '.', 'This', ',', 'is', 'a', 'text', '.']


Although ''.strip() is not None, `if ''.strip()` will be ignore: works just like `if None`, the condition will not be executed.

NOTE
When developing a simple tokenizer, whether we should encode
whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. 

However, keeping whitespaces can be useful if we train models that are **sensitive to the exact structure of the text (for example,
Python code, which is sensitive to indentation and spacing)**. 

Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

In [15]:
# a more complex scenario
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [22]:
# apply re-tokenization on 'the-verdict.txt'
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed)) # 20479 -> 4690

4690


In [23]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## Building vocabulary:
<!-- ![](mdfig\2025-04-14-01-09-31.png) -->
<img src="mdfig\2025-04-14-01-09-31.png" width="100%" alt="描述"> 

In [25]:
# converting tokens into token IDs
# ---- building a vocabulary ----

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print('vocab_size:',vocab_size)

vocab_size: 1130


In [27]:
vocab = {token:integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    # show the first 11 entries for illustration purpose
    if i >=10:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)


In [38]:
# ---------- tips ------------
# another way of printing a dict
print([item for item in vocab.items()][:10])
# or
print(list(vocab.items())[:10])

[('!', 0), ('"', 1), ("'", 2), ('(', 3), (')', 4), (',', 5), ('--', 6), ('.', 7), (':', 8), (';', 9)]
[('!', 0), ('"', 1), ("'", 2), ('(', 3), (')', 4), (',', 5), ('--', 6), ('.', 7), (':', 8), (';', 9)]


- We also want to convert ids into words (conver from numbers back into text)

In [33]:
# a complete tokenizer with 'encode' 'decode' method
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        # remove white space
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [34]:
tokenizer = SimpleTokenizerV1(vocab)
text =  """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [35]:
# turn ids in to words back again
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [36]:
# attempting to tokenize words not in vocab leads to error
text = 'Hello, do you like tea?'
print(tokenizer.encode(text))

KeyError: 'Hello'

In [40]:
# adding special context tokens

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

print(len(vocab.items()))

1132


In [41]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [44]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        # if exist in vocab then assign id, else assign <|unk|>
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        return text

In [43]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [47]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|> , do you like tea ? <|endoftext|> In the sunlit terraces of the <|unk|> .


- using byte-pair encoding

In [48]:
# check tiktoken version
from importlib.metadata import version
import tiktoken
print('tiktoken version:', version("tiktoken"))

tiktoken version: 0.9.0


`<|endoftext|>` is a special `str` in `tiktoken`, you need to allow it before using it: if your text has a complete `<|endoftext|>` but you did not set `allowed_special = <|endoftext|>` in `tokenizer`, an <span style="color: red">Error</span> will be popped up:

In [55]:
# will be downloaded to cache file data-gym-cache
# at C:\Users\xuguy\AppData\Local\Temp\data-gym-cache
tokenizer = tiktoken.get_encoding('gpt2')
tokenizer.encode('<|endoftext|>')

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.


In [52]:
text = (
"Hello, do you like tea? <|endoftext|> In the sunlit terraces"
"of someunknownPlace."
)

# allowed_speical receives a set {}
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [57]:
# tokenize the whole 'the-verdict.txt'abs
with open("the-verdict.txt", 'r', encoding='utf-8') as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))
print(enc_text[:100])

5145
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11, 290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686, 41976, 13, 357, 10915, 314, 2138, 1807, 340, 561, 423, 587, 10598, 393, 28537, 2014, 198, 198, 1, 464, 6001, 286, 465, 13476, 1, 438, 5562, 373, 644, 262, 1466, 1444, 340, 13, 314, 460, 3285, 9074, 13, 46606, 536]


In [58]:
enc_sample = enc_text[50:]

context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f'x: {x}')
print(f'y:      {y}')

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [59]:
# create next-word prediction tasks:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, '---->', desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [60]:
# text version:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), '---->', tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


In [61]:
# create efficient dataloader to create src and tgt
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        # use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    # returns the total number of rows from the dataset
    def __len__(self):
        return len(self.input_ids)
    # returns a single row from the dataset
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [62]:
def create_dataloader_v1(txt, batch_size = 4, max_length = 256, stride = 128, shuffle=True, drop_last = True, num_workers = 0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset, 
        batch_size = batch_size,
        shuffle = shuffle,
        # drop las batch if it shorter than batch_size
        drop_last = drop_last,
        num_workers = num_workers
    )

    return dataloader

In [70]:
# test dataloader and see what is the output of dataloader like
with open('the-verdict.txt', 'r', encoding = 'utf-8') as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size = 1, max_length = 4, stride = 1, shuffle = False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [69]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In [71]:
# batch size greater than 1
# increase the stride to 4 to utilize the data set fully
# stride = max_length ensure there is no overlap
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride = 4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f'inputs: {inputs}')
print(f'targets: {targets}')

inputs: tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
targets: tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


In [72]:
# word embedding (turning ids into continutous-valued vectors)
# --- start with a simplified example ---
# embedding size = output_dim: the dimension of embedding space
input_ids = torch.tensor([2, 3, 5, 1])

vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [73]:
# converting ids to word embeddings
# number of rows = sequence length
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


In [74]:
# adding position information to embedding:
vocab_size = 50257
output_dim = 256 # := embedding size
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)


In [75]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length, stride = max_length, shuffle = False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print('Token IDs:\n', inputs)
print('\nInputs shape:\n', inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [76]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
# each token is now embedded as a 256-dimensional vector

torch.Size([8, 4, 256])


- 当使用nn.Embedding将位置索引（0,1,2...）映射为向量时，本质上是在把每个位置当作一个独立的"符号"来处理。这和词嵌入（Word Embedding）处理单词的方式完全一致，都是用密集向量表示离散符号。虽然初始值是随机的，但模型会通过训练数据学习到这些位置之间的隐含关系。
- 如果直接用1,2,3这样的标量值，会带来2个问题：
  - 数值的绝对大小会被神经网络理解为有意义的量级（但位置3并不比位置2"大3倍"）
  - 难以表达相对位置关系（位置2与位置1的距离 = 位置3与位置2的距离，这可能不符合语言特性）通过可学习的向量表示，模型可以自动发现更优的位置关系编码。例如相邻位置可能在向量空间中具有特定的方向性关联。
- 当使用output_dim维向量（例如512维）时，模型有足够的高维空间来编码复杂的位置模式。这比单一维度的标量值具备更强的表达能力，可以同时表征多种位置特征（如绝对位置、相对位置、奇偶位置等）。
- position embedding 也是一个可查找的表，随着模型训练，这个查找表会逐渐演化成包含位置信息的编码矩阵。
- 使用随机初始化向量+可学习参数的方案，赋予了模型根据实际数据自动发现最优位置编码策略的能力，这比人工设计的固定编码方案更加灵活和强大


### 为什么随机初始化仍能工作？​​
虽然初始向量是随机的，但通过反向传播，梯度会调整这些向量使得：

- 相邻位置的向量相似度更高
- 特定间隔的位置形成规律性模式，最终在向量空间中形成有意义的几何结构
  例如经过训练后，我们可能会发现位置i的向量 ≈ 位置i-1的向量 + 某个固定方向向量，这种隐含的线性关系就是模型自动学习到的位置编码策略。

- https://yuanbao.tencent.com/chat/naQivTmsDa/7a2b50e1-63d7-42e1-ba3e-7b64d857b2cf

In [77]:
# positional embedding layer:
# context_length is a variable that represents the supported input size of the LLM
context_length = max_length

pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [82]:
# the final step: adding token embedding to pos_embedding
# such output: input_embeddings is the one that can be processed by the LLM
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
