## 语料库数据

In [1]:
dir='./data/shakespeare_input.txt'
with open(dir,'r',encoding='utf-8') as f:
    text=f.read()


In [2]:
print(len(text))#整个语料库大小

1115394


In [3]:
#here are the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(f'vocabulary size:{vocab_size}')



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocabulary size:65


## 分词，训练集/验证集划分

### 分词
这里用的是最简单的分词-字符级标记器（character-level tokenizer），创建两个字典用于存储映射关系。第一个字典存放字符->索引下标映射；第二个字典存放索引下标->字符映射。



##### 拓展知识点
在造分词器的时候，可以权衡codebook size（词汇表）和序列长度。
可以拥有词汇量非常小的非常长的整数序列，也可以拥有词汇量非常大的短整数序列。
？？？
1. 词汇量非常小的非常长的整数序列：

这指的是使用一个较小的词汇表，但允许序列（即文本中的单词或标记序列）有较长的长度。在这种情况下，每个词或标记可能由一个较大的整数来表示，因为词汇表中的项较少，所以可以使用较大的数值范围。
这种方法可能适用于某些特定的应用场景，比如当文本数据具有高度的重复性，或者当模型需要处理非常长的文本序列时。
2. 词汇量非常大的短整数序列：

这指的是使用一个较大的词汇表，但限制序列的长度。在这种情况下，每个词或标记由一个较小的整数来表示，因为词汇表中的项较多，所以每个项的表示范围较小。
这种方法可能更适用于处理多样化的文本数据，因为它能够捕捉到更多的词汇细节，但同时也需要考虑到模型的内存和计算效率，因为较大的词汇表可能会增加模型的复杂度。
--------------
个人理解

如果词汇表很大的话，那么（极端情况下）每个词/字符/字节都会有对应的索引下标。这样会造成一句话的序列可能会很长，所以一般当文本很短的时候，我们可以用大词汇表，小序列长度的方式，增加对词汇细微差别的理解。例如：社交媒体文本通常包含大量的俚语、表情符号和个性化词汇。在这种情况下，可能需要一个较大的词汇表来捕捉这些多样化的表达方式，但由于内存和计算资源的限制，序列长度可能需要被限制。

如果词汇表很小的话，序列长度很长的话。往往是用于一些重复率比较高的长文本。例如：生物信息学，
在处理基因序列数据时，序列长度可以非常长，但使用的“词汇”（即核苷酸）只有四种（A、T、C、G）。这里，序列长度是关键，而词汇量非常小。

--------------
Q:如果在特定场景应用了不恰当的分词策略，会造成什么影响？

Q:对话大模型在实际应用中会根据用户输入的字符长度来动态选择分词策略吗？

In [4]:
###Tokenizers
# create a mapping from characters to integers
str2idx = {ch: i for i, ch in enumerate(chars)}
idx2str = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [str2idx[c] for c in s]
decode = lambda l: ''.join(idx2str[c] for c in l)

In [5]:
print(encode('hii there'))
print(decode(encode('hii there')))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


在实践中，类似的分词策略例如：Google利用SentencePiece，即一种子词类型的标记器。该标记器没有对整个单词进行编码，也没有对单个字符进行编码。它是一个子字单元级别。

OpenAI用了一个称为TickToken的库，它使用字节对编码标记器

In [6]:
#将整个数据集进行编码
import torch
data = torch.tensor(encode(text),dtype=torch.long)#这两个参数是什么意思？#这个函数在干嘛？
print(data.shape, data.dtype)
print(data[:1000])


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/apple/Library/r-miniconda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/apple/Library/r-miniconda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/apple/Library/r-miniconda/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/apple/Library/r-miniconda/lib/python3.9/site-packages/traitlets/config/application.py", line 1075, in launc

torch.Size([1115394]) torch.int64
tensor([ 0, 18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43,
        44, 53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52,
        63,  1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,
         1, 57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39,
        49,  6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15,
        47, 58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50,
        50,  1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1,
        58, 53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51,
        47, 57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43,
        42,  8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57,
        58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1,
        63, 53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56,
      

### 划分训练集/验证集

In [7]:
n = int(0.9*len(data))
train_data = data[:n]#训练集
val_data = data[n:]#验证集

In [8]:
print(len(train_data))
print(len(val_data))

1003854
111540


## 数据加载器：批量数据块

当我们训练Transformer时，我们不会将所有的语料一次性喂给模型。

我们从训练集中随机抽取一个训练块进行训练，这些块的长度称为block_size

In [9]:
#假设block_size为8，第一个block为
block_size=8
train_data[:block_size+1]

tensor([ 0, 18, 47, 56, 57, 58,  1, 15, 47])

上面的代码中，之所以block_size要+1是因为：当你从训练集中抽取一块数据时，比如上述的例子中有9个字符，其实包含了8个示例。每个例子如下： 

为什么这里Karpathy提到输入的是Transformer的张量的时间维度

In [10]:
x = train_data[:block_size]#前block_size的元素
y = train_data[1:block_size+1]#这里为什么是从1开始，为什么要到
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([0]) the target: 18
when input is tensor([ 0, 18]) the target: 47
when input is tensor([ 0, 18, 47]) the target: 56
when input is tensor([ 0, 18, 47, 56]) the target: 57
when input is tensor([ 0, 18, 47, 56, 57]) the target: 58
when input is tensor([ 0, 18, 47, 56, 57, 58]) the target: 1
when input is tensor([ 0, 18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([ 0, 18, 47, 56, 57, 58,  1, 15]) the target: 47


##### Batch size
每次我们要将它们输入Transformer时，我们都会有许多批次的多个文本块，它们都堆叠在一个张量(tensor)中，这样做的原因是为了提高效率，以便利用GPU的并行计算能力来并行处理数据。这些块是完全独立处理的，彼此之间不通信。

Every time we're going to feed them into a transformer, we're going to have many batches of multiple chunks of text that are all stacked up in a single tensor.

所以单个tensor包含了多个batch（批次），每个批次包含了多个chunks

In [13]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for prediction?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split =='train' else val_data#输入训练数据/验证数据
    ix = torch.randint(len(data) - block_size, (batch_size,))#通过随机偏移量来选择不同的数据块
    #生成偏移量的逻辑：在0-(len(data)-block_size)的范围内，随机生成batch_size个
    x = torch.stack([data[i:i+block_size] for i in ix])#把这些随机抽取的序列进行拼接成batch
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('input:')
print(xb.shape)
print(xb)
print('target:')
print(yb.shape)
print(yb)

print('---------------')

for b in range(batch_size):#batch dimension
    for t in range(block_size):#time dimension
        context=xb[b,:t+1]
        target=yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

input:
torch.Size([4, 8])
tensor([[ 0, 24, 43, 58,  5, 57,  1, 46],
        [ 1, 44, 53, 56,  1, 58, 46, 39],
        [43, 52, 58,  1, 58, 46, 39, 58],
        [27, 25, 17, 27, 10,  0, 21,  1]])
target:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
---------------
when input is [0] the target: 24
when input is [0, 24] the target: 43
when input is [0, 24, 43] the target: 58
when input is [0, 24, 43, 58] the target: 5
when input is [0, 24, 43, 58, 5] the target: 57
when input is [0, 24, 43, 58, 5, 57] the target: 1
when input is [0, 24, 43, 58, 5, 57, 1] the target: 46
when input is [0, 24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [1] the target: 44
when input is [1, 44] the target: 53
when input is [1, 44, 53] the target: 56
when input is [1, 44, 53, 56] the target: 1
when input is [1, 44, 53, 56, 1] the target: 58
when input is [1, 44, 

## 最简单的基线：二元语法模型，损失，生成


In [27]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        #token embedding table -> 标记嵌入表
        #nn.Embedding -> 一个非常薄的包装器，基本上是一个形状为vocab_size*vocab_size的张量
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets):#这是一个封装起来的强制函数
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) #（B,T,C),在本例中,B(batch)批次为4，T(time)为8，C(chanel)为vocabSize即65
        return logits
        
        #但是这里没有调用logits函数啊，怎么计算的呢
m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
out
        
        
        
        
        
        
        

tensor([[[ 0.1808, -0.0700, -0.3596,  ...,  1.6097, -0.4032, -0.8345],
         [-1.5101, -0.0948,  1.0927,  ..., -0.6126, -0.6597,  0.7624],
         [ 0.3323, -0.0872, -0.7470,  ..., -0.6716, -0.9572, -0.9594],
         ...,
         [-0.5201,  0.2831,  1.0847,  ..., -0.0198,  0.7959,  1.6014],
         [ 0.5978, -0.0514, -0.0646,  ..., -1.4649, -2.0555,  1.8275],
         [ 1.0901,  0.2170, -2.9996,  ..., -0.5472, -0.8017,  0.7761]],

        [[ 0.5978, -0.0514, -0.0646,  ..., -1.4649, -2.0555,  1.8275],
         [ 1.0541,  1.5018, -0.5266,  ...,  1.8574,  1.5249,  1.3035],
         [-0.1324, -0.5489,  0.1024,  ..., -0.8599, -1.6050, -0.6985],
         ...,
         [ 0.2475, -0.6349, -1.2909,  ...,  1.3064, -0.2256, -1.8305],
         [ 1.0901,  0.2170, -2.9996,  ..., -0.5472, -0.8017,  0.7761],
         [ 1.1513,  1.0539,  3.4105,  ..., -0.5686,  0.9079, -0.1701]],

        [[ 0.3323, -0.0872, -0.7470,  ..., -0.6716, -0.9572, -0.9594],
         [-0.2103,  0.4481,  1.2381,  ...,  1

Object `forward` not found.
