# 1 理解大语言模型 - Large Language Model (LLM)

> 主要结构如下：
从raw data中进行预训练，得出基础模型（这一部分可以了解一下元学习的概念），这个基础模型所拥有的基础能力为文本补全、短时任务的推理能力。</br>
> 在基础模型之上，可以导入自己标记的数据进行训练，这一部分可以成为微调（finetune），得到自己的LLM，可以用于分类，总结，翻译，个人助理等任务。

![1716275709784](image/从零开始构建LLM/1716275709784.png)

> **Transformer** 结构概览</br>
1、输入需要被翻译的文本</br>
2、预处理文本</br>
3、编码器将输入文本进行编码</br>
4、将编码部分送入解码器</br>
5、模型每次只完成一个单词的翻译</br>
6、预处理文本</br>
7、解码器生成一个单词</br>
8、完成翻译</br>

![1716275687724](image/从零开始构建LLM/1716275687724.png)

> BERT与GPT区别：BERT更多的使用于文本填空，GPT则是预测下一个单词。

![1716275758151](image/从零开始构建LLM/1716275758151.png)

> **构建大模型步骤**</br>

|阶段|子项|
|---|---|
|一|准备数据和样本|
||实现注意力机制|
||实现LLM结构|
|二|训练|
||模型评估|
||加载预训练模型权重|
|三|微调自己的模型|

![1716275818354](image/从零开始构建LLM/1716275818354.png)

# 2 文本数据处理

## 2.1 词嵌入
词嵌入的根本目的是为了**将非数值数据转换为向量**，这样才能放入计算机进行运算。常见词嵌入的有**Word2Vec**。在GPT架构中，没有使用这一技术，GPT3的嵌入大小达到了12288维。其中，GPT将词嵌入作为训练模型，不断调整。也就是说，**GPT将词嵌入这一部分也进行训练**。

![1716433691383](image/从零开始构建LLM/1716433691383.png)

## 2.2 标记文本
标记文本就是将文本进行拆分，拆分为单个单词后，对每个单词进行唯一映射。可以使用字典进行标记，将每个单词映射为token id，再使用token id进行词嵌入。

In [1]:
import os
import re

In [11]:
filepath = os.path.join('data', 'the-verdict.txt')
assert os.path.exists(filepath), f"{filepath} is not exists."
with open(filepath) as f:
    raw_text = f.read()
print(">> Total number of character:", len(raw_text))
print(">> raw text:", raw_text[:100])
print()

# split raw text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]  # remove empty string
print(">> preprocessed:", preprocessed[:30])
print(">> length:", len(preprocessed))
print()

# remove duplicate words
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(">> size of vocab after removed duplicate words:", vocab_size)

# create vocab
vocab = {token:integer for integer,token in enumerate(all_words)}
print(">> vocab: front 20 items")
for tok, i in vocab.items():
    if i > 20:
        break
    print(tok, i)

>> Total number of character: 20479
>> raw text: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g

>> preprocessed: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
>> length: 4690

>> size of vocab after removed duplicate words: 1130
>> vocab: front 20 items
! 0
" 1
' 2
( 3
) 4
, 5
-- 6
. 7
: 8
; 9
? 10
A 11
Ah 12
Among 13
And 14
Are 15
Arrt 16
As 17
At 18
Be 19
Begin 20


![1716433945337](image/从零开始构建LLM/1716433945337.png)

字典表的创建方式可以通过自己创建，通过创建后的字典表，可以实现文本与token id之间的互相转换。

In [15]:
class SimpleTokenizerV1:
    def __init__(self, vocab):  # our vocab
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}  # reverse k, v
    
    def encode(self, text):  # our text
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()  # remove empty string
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
print(">> original text: ", text)

ids = tokenizer.encode(text)
print(">> encoded data:", ids)

decoded_text = tokenizer.decode(ids)
print(">> decoded data:", decoded_text)

>> original text:  "It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride.
>> encoded data: [1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
>> decoded data: " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


![1716434053588](image/从零开始构建LLM/1716434053588.png)

## 2.3 特殊处理
正如一般的数据预处理流程，文本中的异常数据也应当注意。当上述字典表覆盖不全面时，针对不在字典表中的字符就需要特殊处理，并且不同句子之间，也需要分割符。</br>

**为未知单词加入一些特殊标记**是非常有用的。作用如下：

* 使用特殊标记来帮助 LLM 提供额外的上下文
* 注：一些特殊标记如下<br/>
    1. [BOS] Beginning of sequence. 文本开始<br/>
    2. [EOS] end of sequence. 文本结束<br/>
    3. [PAD] padding. 使训练文本长度统一<br/>
    [UNK] 未知字符，不在字典表中<br/>
* GPT-2中仅使用`<|endoftext|>`减少复杂性，`<|endoftext|>`与`[EOS]`用法类似。GPT-2同时使用`<|endoftext|>`来进行PAD操作。
* 对于未知单词，GPT-2未使用[UNK]进行替代，而是使用字节对编码-(byte-pair encoding, BPE)将单词进行分解。

因此在上述V1版本上，我们需要进行改进，将未知字符与分割符加入字典表中：

`all_tokens.extend(["<|endoftext|>", "<|unk|>"])`

![1716455910828](image/从零开始构建LLM/1716455910828.png)

In [24]:
# preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)  # pre version
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

vocab_size = len(vocab)
print(">> size of vocab after removed duplicate words:", vocab_size)

print(">> vocab: last 5 items")
for i, tok in enumerate(list(vocab.items())[-5:]):
    print(tok)

>> size of vocab after removed duplicate words: 1161
>> vocab: last 5 items
('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


In [21]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(">> input text:", text)

ids = tokenizer.encode(text)
print(">> encoded data:", ids)

decoded_text = tokenizer.decode(ids)
print(">> decoded data:", decoded_text)

>> input text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
>> encoded data: [1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]
>> decoded data: <|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


## 2.4 字节对编码

`pip install tiktoken`


In [26]:
%pip install tiktoken
import tiktoken

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Note: you may need to restart the kernel to use updated packages.


In [30]:
tokenizer = tiktoken.get_encoding("gpt2")

text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."

ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(">> encoded data:", ids)

decoded_text = tokenizer.decode(ids)
print(">> decoded data:", decoded_text)

>> encoded data: [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
>> decoded data: Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


> BPE会将未知单词拆分成独立个体的单词

![1716778401378](image/从零开始构建LLM/1716778401378.png)

## 2.5 使用滑窗进行数据采样

![1716778580547](image/从零开始构建LLM/1716778580547.png)

In [37]:
enc_text = tokenizer.encode(raw_text)
enc_sample = enc_text[50:]

context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size + 1]
print(f">> x: {x}")
print(f">> y: {y}")
print()

print(">> tokenizer encode in one context:")
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f">> {context} --> {desired}")
print()

print(">> tokenizer decode in one context:")
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f">> {tokenizer.decode(context)} --> {tokenizer.decode([desired])}")

>> x: [290, 4920, 2241, 287]
>> y: [4920, 2241, 287, 257]

>> tokenizer encode in one context:
>> [290] --> 4920
>> [290, 4920] --> 2241
>> [290, 4920, 2241] --> 287
>> [290, 4920, 2241, 287] --> 257

>> tokenizer decode in one context:
>>  and -->  established
>>  and established -->  himself
>>  and established himself -->  in
>>  and established himself in -->  a


因此，我们主要关心的只有两个向量，输入和输出

![1716778596288](image/从零开始构建LLM/1716778596288.png)

In [38]:
%pip install torch

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Note: you may need to restart the kernel to use updated packages.


In [39]:
import torch
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=0
    )

    return dataloader

In [43]:
# modify: batch_size, max_length, stride
# will get different data
dataloader = create_dataloader_v1(
    raw_text, batch_size=2, max_length=4, stride=2, shuffle=False
)

data_iter = iter(dataloader)
data = next(data_iter)
print(f">> {data}")

>> [tensor([[  40,  367, 2885, 1464],
        [2885, 1464, 1807, 3619]]), tensor([[ 367, 2885, 1464, 1807],
        [1464, 1807, 3619,  402]])]


## 2.6 创建token嵌入

这一部分将token id转换为嵌入向量

![1716778710812](image/从零开始构建LLM/1716778710812.png)

In [45]:
# Simple Example
input_ids = torch.tensor([2, 3, 5, 1])

vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(f">> {embedding_layer.weight}")

print(f">> {embedding_layer(input_ids)}")

>> Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)
>> tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


## 2.7 编码位置向量
当token id一致时，使用同一个词嵌入会得到相同输出，如下图所示：

![1716778847577](image/从零开始构建LLM/1716778847577.png)

为了解决这一问题，引入了位置编码，这样可以保证每一个编码是独一无二的

![1716778935403](image/从零开始构建LLM/1716778935403.png)

最后，所有的数据处理流程如下：

![1716778990725](image/从零开始构建LLM/1716778990725.png)


In [51]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)

inputs, targets = next(data_iter)
print(f">> Token IDs:\n {inputs}")
print(f">> Inputs shape: {inputs.shape}")

token_embeddings = token_embedding_layer(inputs)
print(f">> {token_embeddings.shape}")
# >> (8, 4, 256) -> 8: batch_size, 4: max_length, 256: output_dim

context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(f">> pos embeddings's shape: {pos_embeddings.shape}")

input_embeddings = token_embeddings + pos_embeddings
print(f">> input embeddings's shape: {input_embeddings.shape}")

>> Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
>> Inputs shape: torch.Size([8, 4])
>> torch.Size([8, 4, 256])
>> pos embeddings's shape: torch.Size([4, 256])
>> input embeddings's shape: torch.Size([8, 4, 256])


# 3 编码注意力机制

主要流程如下：
1. 一个简单的自注意力
2. LLM中使用的注意力机制
3. 因果关系的注意力机制
4. 多头注意力机制

![1716780422961](image/从零开始构建LLM/1716780422961.png)

## 3.1 长时序建模的问题

主要问题是上下文丢失。如RNN不能在解码阶段直接从编码器中访问早期的隐藏状态。因此，它只依赖于当前的隐藏状态，它封装了所有相关的信息。这可能会导致上下文的丢失，特别是在依赖关系可能跨越较长距离的复杂句子中。

## 3.2 使用注意机制捕获数据依赖关系

早期为了解决RNN对于长时序问题，研究者提出以下结构，被成为*Bahdanau attention*，这一机制使得解码阶段能够访问编码早期状态。

![1718697183686](image/从零开始构建LLM/1718697183686.png)

之后根据*Bahdanau attention*得到启发，提出了早期的*Transformer*结构。

![1716877433399](image/从零开始构建LLM/1716877433399.png)

## 3.3 自注意输入的不同部分

自注意力是LLM中Transformer的基石。
在自注意力中，“自我”是指该机制通过关联单个输入序列中的不同位置来计算注意权重的能力。它关注的是本身不同部分的关系和依赖。而传统的注意力机制则是关注两个序列之间的关系

### 3.3.1 一个简单的自我注意机制，没有训练权重

自注意的目标是为每个输入元素计算一个上下文向量，它结合了来自所有其他输入元素的信息。在自注意力中，我们的目标是为每一个输入元素${x^{(i)}}$计算上下文向量${z^{(i)}}$。一个上下文向量可以被解释为一个丰富的嵌入向量。</br>
如下图所示，*Your journey starts with one step*为输入句子，现在关注${x^{(2)}}$与${z^{(2)}}$，${z^{(2)}}$包含了从${x^{(1)}}$到${x^{(T)}}$之间的所有信息。
在自注意过程中，上下文向量起着至关重要的作用。它们的目的是通过在序列中合并来自所有其他元素的信息，在输入序列中（如句子）中创建每个元素的丰富表示，如下图所示。

![1716879045635](image/从零开始构建LLM/1716879045635.png)

![1716881565809](image/从零开始构建LLM/1716881565809.png)

In [52]:
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

query = inputs[1]  # 2nd input token is the query

attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)

print(f">> {attn_scores_2}")

>> tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


> 上述操作可以理解为矩阵的乘法 dot product，其中值越大，表示相关性越高

紧接着需要对其进行归一化操作

In [54]:
attn_scores_2 = attn_scores_2 / attn_scores_2.sum()
print(f">> attn_scores for x^2: {attn_scores_2}")
print(f">> attn_scores's sum for x^2: {attn_scores_2.sum()}")

>> attn_scores for x^2: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
>> attn_scores's sum for x^2: 1.0


> 在实际中，更多的是使用softmax操作，这一操作在处理极值和梯度时有更好的表现。

In [58]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print(f">> attn_weights_naive for x^2: {attn_weights_2_naive}")
print(f">> attn_weights_naive's sum for x^2: {attn_weights_2_naive.sum()}")
print()

attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print(f">> attn_weights for x^2: {attn_weights_2}")
print(f">> attn_weights's sum for x^2: {attn_weights_2.sum()}")

>> attn_weights_naive for x^2: tensor([0.1630, 0.1770, 0.1765, 0.1603, 0.1570, 0.1663])
>> attn_weights_naive's sum for x^2: 1.0

>> attn_weights for x^2: tensor([0.1630, 0.1770, 0.1765, 0.1603, 0.1570, 0.1663])
>> attn_weights's sum for x^2: 1.0


In [60]:
# Above All
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i] * x_i
print(f">> context_vec: {context_vec_2}")

>> context_vec: tensor([0.4325, 0.5937, 0.5349])


### 3.3.2 为所有输入计算权重

![1716945187106](image/从零开始构建LLM/1716945187106.png)

计算流程与之前一致

![1716945198064](image/从零开始构建LLM/1716945198064.png)

In [69]:
# >> attention scores
# method 1
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(f"attn scores: {attn_scores}")

# method 2
attn_scores = torch.matmul(inputs, inputs.T)
print(f"attn scores: {attn_scores}")

# >> softmax
attn_weights = torch.softmax(attn_scores, dim=1)
print(f"attn weights (softmax): {attn_scores}")

# >> attention weights


attn scores: tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
attn scores: tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
attn weights (softmax): tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7

In [None]:
attn_weight