<a href="https://colab.research.google.com/github/waitkeeper/aistudy/blob/main/LLMs_from_scratch_CN-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 学习教程

B站视频
https://www.bilibili.com/video/BV16AKAzzECq/?spm_id_from=333.1387.favlist.content.click&vd_source=795bf9ea159d202907a8da08d96e68b8

中文版文档资料
https://skindhu.github.io/Build-A-Large-Language-Model-CN/#/./cn-Book/1.%E7%90%86%E8%A7%A3%E5%A4%A7%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B


In [None]:

!pip install torch
!pip install tiktoken




# 1. 数据准备


In [None]:
import requests
import os

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch02/01_main-chapter-code/the-verdict.txt"
filename = "the-verdict.txt"

if not os.path.exists(filename):
    response = requests.get(url)
    response.raise_for_status() # Raise an exception for bad status codes

    with open(filename, "wb") as f:
        f.write(response.content)

    print(f"File '{filename}' downloaded successfully.")
else:
    print(f"File '{filename}' already exists.")

# Listing 2.1 Reading in a short story as text sample into Python
with open("the-verdict.txt", "r", encoding="utf-8") as f:
        raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

File 'the-verdict.txt' downloaded successfully.
Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


# 2. 文本分词

In [None]:
# 自测试数据
import re
text_ori = "Hello, world. This, is a test."
result01 = re.split(r'(\s)', text_ori)
print(result01)
result02 = re.split(r'([,.]|\s)', text_ori)
print(result02)
result02_1 = [item for item in result02 if item.strip()]
print(result02_1)
#

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']
['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']
['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


# 将tokens转化为token ID

In [None]:
# set的作用是去除重复元素
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)
# 这个语法是python中的字典推导式
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 10:
        break


1130
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)


In [None]:
# token到 tokenID，tokenID也需要转为token
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {integer: token for token, integer in vocab.items()}
    def encode(self,text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[str] for str in preprocessed]
        return ids
    def decode(self,ids):
        text = " ".join([self.int_to_str[id] for id in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


In [None]:
# 测试上面的类
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)
tokens = tokenizer.decode(ids)
print(tokens)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [None]:
# 问题
text = "Hello, do you like tea?"
print(tokenizer.encode(text))
# 这个报错的原因是我们的文章中没有Hello这个单词

KeyError: 'Hello'

In [None]:
allTokens = sorted(list(set(preprocessed)))
allTokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(allTokens)}
print(len(vocab.items()))

In [None]:
# 开始编写第二版的
class SimpleTokenizerV2:
    def __init__(self,vocab) -> None:
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    def encode(self,text):
        preTokens = re.split(r'([,.?_!"()\']|--|\s)',text)
        preTokens = [item.strip() for item in preTokens if item.strip()]
        preTokens = [item if item in self.str_to_int else "<|unk|>" for item in preTokens]
        ids = [self.str_to_int[s] for s in preTokens ]
        return ids
    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

In [None]:
tokenizerV2 = SimpleTokenizerV2(vocab)
resIds = tokenizerV2.encode(text)
resText = tokenizerV2.decode(resIds)
print(resText)


# 字节对编码

In [None]:
# pip install tiktoken

from importlib.metadata import version
import tiktoken

print("toktoken version：",version("tiktoken"))
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
tokenizer01 = tiktoken.get_encoding('gpt2')
integer01 = tokenizer01.encode(text,allowed_special={"<|endoftext|>"})
print(integer01) # 这里就会发现 <|endoftext|> 被分配了一个很大的id，50256，这是gpt2的最大tokenid
recoverText = tokenizer01.decode(integer01)
print(recoverText)

# 使用滑动窗口来进行数据采样

In [None]:
with open('the-verdict.txt','r',encoding='utf-8')as f:
    raw_text = f.read()

enc_text = tokenizer01.encode(raw_text)
print(len(enc_text))

In [None]:
context_size = 4 # 上下文大小决定了输入中有多少个token

enc_sample = enc_text[:50]
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x:{x}")
print(f"y:    {y}")

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "--->", desired)
    print(tokenizer01.decode(context),"--->", tokenizer01.decode([desired]))

In [None]:
# 实现数据加载器类

import torch
from torch.utils.data import Dataset,DataLoader
import tiktoken

class GPTDatasetV1(Dataset):
    def __init__(self,txt,tokenizer,max_lenght,stride) -> None:
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt)
        # 使用滑动窗口将书籍分块为最大长度的重叠序列
        for i in range(0,len(token_ids)-max_lenght,stride):
            input_chunk = token_ids[i:i+max_lenght]
            target_chunk = token_ids[i+1:i+max_lenght+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, index):
        return self.input_ids[index],self.target_ids[index]

# 使用上面刚创建的 GPTDatasetV1类，通过 PyTorch DataLoader以批量方式加载输入
def create_dataloader_v1(txt,batch_size=4, max_length=256,
                         stride=128, shuffle=True,drop_last=True,num_workers=0):
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = GPTDatasetV1(txt,tokenizer,max_length,stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader


In [None]:
# 测试下上面的类
with open('the-verdict.txt',"r",encoding='utf-8') as f :
    raw_text = f.read()
dataloader = create_dataloader_v1(raw_text,batch_size=1,max_length=4,stride=1,shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)
second_batch = next(data_iter)
print(second_batch)

In [None]:
# 上面我们的批次都是1，主要是为了说明运作原理。
# 在深度学习中，小批次在训练时消耗的内存较小，但会使模型更新更加苦难

# 构建词嵌入层

为LLM准备训练集的最后一步是将token ID转化为嵌入向量。
我们首先会以随机值的方式初始化这些嵌入权重。 后面还会优化嵌入权重，作为LLM训练的一部分

In [None]:
# 用一个例子来说明tokenId到嵌入向量转化的工作原理

# 假设有 4 个 token
input_ids = torch.tensor([2,3,5,1])
# 假设我们词汇表只有6个单词
vocab_size = 6
# 假设嵌入向量的维度是3 （在GPT-3中，嵌入大小的维度是12288维）
output_dim = 3

# 使用 vocab_size 和 output_dim 在pytorch中实例化一个嵌入层
# 随机种子 123，方便结果可复现
torch.manual_seed(123)

embedding_layer = torch.nn.Embedding(vocab_size,output_dim)
print(embedding_layer.weight)
# 嵌入矩阵的每一行表示词汇表中的一个token，每个token都有唯一的向量表示
# 嵌入矩阵中的每一列表示嵌入空间中的一个维度，当前例子表示嵌入空间有3个维度

In [None]:
# 有了嵌入层后，我们就可以通过它获取指定tokenId的嵌入向量
print(embedding_layer(torch.tensor([3])))
# 在 PyTorch 中，当你像函数一样调用一个 nn.Module 的实例（例如 embedding_layer(input_tensor)）时
# 实际上是调用了该模块的 forward 方法。
# 对于 nn.Embedding 模块，它的 forward 方法会接收一个包含 token ID 的张量作为输入，
# 然后查找这些 token ID 对应的嵌入向量（embedding vectors），并返回这些向量组成的张量。

In [None]:
# 4个输入token id的情况
print(embedding_layer(torch.tensor([2,3,5,1])))

# 位置编码

上一节中，我们将token id转换为连续的向量表示，就是token嵌入

从格式上来说，这是适合作为LLM的输入的，然后LLM的一个小缺点是他的自注意力机制对序列中的token的位置或者顺序没有概念

而我们前面嵌入层的引入方式是，无论token处于序列中的什么位置，最后得到的嵌入向量都是相同的

怎么解决： 位置嵌入

嵌入有两种
1. 绝对位置嵌入
2. 相对位置嵌入
相对位置嵌入强调的是token之前的相对位置或者举例。



In [None]:
# 调整下嵌入层

vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [None]:
max_length = 4
dataloader = create_dataloader_v1(
      raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:", inputs)
print("Inputs shape:", inputs.shape)

In [None]:
# 使用嵌入层将这些token id转化为256维度的向量
token_embedding = token_embedding_layer(inputs)
print(token_embedding.shape)

In [None]:
# 对于GPT模型使用位置绝对嵌入，我们只需要创建另一个嵌入层，维度和token_embedding_layer 一致

context_length = max_length
pos_embeding_layer = torch.nn.Embedding(context_length,output_dim)
pos_embeding = pos_embeding_layer(torch.arange(context_length))
print(pos_embeding.shape)


In [None]:
input_embedding = token_embedding + pos_embeding
print(input_embedding)