<a href="https://colab.research.google.com/github/shake/colab-Llama-2-ipynb/blob/main/huggingface/hg_04_understanding_model_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 小白入门HuggingFace

## 04 看懂模型文件

当使用 `AutoClass` 或者 `pipeline()` 进行模型加载时，我们会看到一组不同的文件被下载。

大家是否好奇过这些都是啥文件，有啥用？今天我们就以模型 [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) 为例，介绍模型的文件组成。

[Understanding the Llama2 Tokenizer: Working with the Tokenizer locally using Transformers](https://medium.com/@vyperius117/understanding-the-llama2-tokenizer-working-with-the-tokenizer-locally-using-transformers-2e0f9e69d786)

### 一个典型模型加载过程

In [None]:
!pip install -q git+https://github.com/huggingface/transformers torch sentencepiece

In [2]:
from google.colab import userdata
hf_token = userdata.get('huggingface')
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", trust_remote_code=True, token=hf_token)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", trust_remote_code=True, token=hf_token)

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [8]:
# Constants
stdout_padding = "#" * 20

# Confirm vocabulary size
print(f"{stdout_padding} Llama2 Tokenizer Details {stdout_padding}\n")
print(f"Llama2 tokenizer overview: {tokenizer}")
print(f"Llama2 Vocabulary Size: {len(tokenizer.get_vocab().keys())}\n")
print(f"{stdout_padding} End of Llama2 Tokenizer Details {stdout_padding}\n")


#################### Llama2 Tokenizer Details ####################

Llama2 tokenizer overview: LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Llama2 Vocabulary Size: 32000

#################### End of Llama2 Tokenizer Details ####################



In [9]:
# Verify token IDs for Llama2 special tokens
print(f"{stdout_padding} Llama2 Special Tokens {stdout_padding}\n")

UNK = "<unk>" # Unknown token
BOS, EOS = "<s>", "</s>" # Begin of sequnece and end of sequence tokens

special_tokens = [UNK, BOS, EOS]

for token in special_tokens:
    print(f'Token ID for the special token {token}: {tokenizer.get_vocab()[token]}')
    print(f'Encoded {token} becomes: {tokenizer.encode(token)}\n')

print(f"{stdout_padding} End of Llama2 Special Tokens {stdout_padding}\n")

#################### Llama2 Special Tokens ####################

Token ID for the special token <unk>: 0
Encoded <unk> becomes: [1, 0]

Token ID for the special token <s>: 1
Encoded <s> becomes: [1, 1]

Token ID for the special token </s>: 2
Encoded </s> becomes: [1, 2]

#################### End of Llama2 Special Tokens ####################



In [10]:
# Verify token IDs for Llama2 prompt symbols
print(f"{stdout_padding} Llama2 Prompt Symbols {stdout_padding}\n")

B_INST, E_INST = "[INST]", "[/INST]" # Begin of instruction and end of instruction symbols
B_SYS, E_SYS = "<<SYS>>\n", "\n<<SYS>>\n\n" # Begin of system message and end of system message symbols

prompt_symbols = [B_INST, E_INST, B_SYS, E_SYS]

for symbol in prompt_symbols:
    encoded_symbol = tokenizer.encode(symbol)
    print(f'Encoded {repr(symbol)} becomes: {encoded_symbol}')

    for token in encoded_symbol:
        print(f"\tToken ID {token} --> {repr(tokenizer.decode(token))}")

print(f"\n{stdout_padding} End of Llama2 Prompt Symbols {stdout_padding}\n")

#################### Llama2 Prompt Symbols ####################

Encoded '[INST]' becomes: [1, 518, 25580, 29962]
	Token ID 1 --> '<s>'
	Token ID 518 --> '['
	Token ID 25580 --> 'INST'
	Token ID 29962 --> ']'
Encoded '[/INST]' becomes: [1, 518, 29914, 25580, 29962]
	Token ID 1 --> '<s>'
	Token ID 518 --> '['
	Token ID 29914 --> '/'
	Token ID 25580 --> 'INST'
	Token ID 29962 --> ']'
Encoded '<<SYS>>\n' becomes: [1, 3532, 14816, 29903, 6778, 13]
	Token ID 1 --> '<s>'
	Token ID 3532 --> '<<'
	Token ID 14816 --> 'SY'
	Token ID 29903 --> 'S'
	Token ID 6778 --> '>>'
	Token ID 13 --> '\n'
Encoded '\n<<SYS>>\n\n' becomes: [1, 29871, 13, 9314, 14816, 29903, 6778, 13, 13]
	Token ID 1 --> '<s>'
	Token ID 29871 --> ''
	Token ID 13 --> '\n'
	Token ID 9314 --> '<<'
	Token ID 14816 --> 'SY'
	Token ID 29903 --> 'S'
	Token ID 6778 --> '>>'
	Token ID 13 --> '\n'
	Token ID 13 --> '\n'

#################### End of Llama2 Prompt Symbols ####################



In [11]:
# Test tokenizer on a sentence
print(f"{stdout_padding} Llama2 Tokenizer Sentence Example {stdout_padding}\n")
sentence = "RHEL subscription manager let's you manage packages on RedHat."
encoded_output = tokenizer.encode(sentence)
print(f"Original sentence: {sentence}")
print(f"Encoded sentence: {encoded_output}")

# Verify what each token ID correlates to
for token in encoded_output:
    print(f"Token ID {token} --> {tokenizer.decode(token)}")

print(f"{stdout_padding} End of Llama2 Tokenizer Sentence Example {stdout_padding}\n")

#################### Llama2 Tokenizer Sentence Example ####################

Original sentence: RHEL subscription manager let's you manage packages on RedHat.
Encoded sentence: [1, 390, 29950, 6670, 25691, 8455, 1235, 29915, 29879, 366, 10933, 9741, 373, 4367, 29950, 271, 29889]
Token ID 1 --> <s>
Token ID 390 --> R
Token ID 29950 --> H
Token ID 6670 --> EL
Token ID 25691 --> subscription
Token ID 8455 --> manager
Token ID 1235 --> let
Token ID 29915 --> '
Token ID 29879 --> s
Token ID 366 --> you
Token ID 10933 --> manage
Token ID 9741 --> packages
Token ID 373 --> on
Token ID 4367 --> Red
Token ID 29950 --> H
Token ID 271 --> at
Token ID 29889 --> .
#################### End of Llama2 Tokenizer Sentence Example ####################



#### 模型文件详解

模型包含了如下文件：

| 文件名 | 介绍 |
| --- | --- |
| config.json | 模型架构的主要配置,如 Bert模型设置,预测头部设置,训练参数等。 |
| generation_config.json | 文本生成相关的模型配置。 |
| model-00001-of-00002.safesensors | safesensors文件格式的模型权重参数分块1（见后续介绍） |
| model-00002-of-00002.safesensors | safesensors文件格式的模型权重参数分块2 |
| model.safetensors.index.json | safesensors模型参数文件索引和描述模型切片的 JSON 文件。 |
| pytorch_model-00001-of-00002.bin | pickle序列化的pytorch模型权重参数分块1 |
| pytorch_model-00002-of-00002.bin | pickle序列化的pytorch模型权重参数分块2 |    
| pytorch_model.bin.index.json | pickle序列化的pytorch索引和描述模型切片的 JSON 文件 |
| special_tokens_map.json | tokenizer中特殊标记符(special tokens)到其对应的数字id的映射。|
| tokenizer.json | tokenizer的配置信息,如字典大小,tokenize的策略等。 |
| tokenizer.model | tokenizer的具体模型参数,这是经过训练得到的二进制文件,不可读。 |
| tokenizer_config.json | 使用该tokenizer时的一些配置,如最大序列长度等 |  

一个完整的大型模型来说，通常会被切分成多个碎片(shards)并以 model-00001-of-00002.safetensors 这种命名方式保存。`pytorch_model.bin.index.json` 文件包含所有的模型切片信息,主要包括:
- 模型切片的总数
- 每个切片的元数据,如名称、偏移地址、文件大小等
- 切片如何组合起来重新组成完整模型的说明
- 一些额外的模型信息,如模型名称、框架版本等元数据

`special_tokens_map.json` 包含 `Tokenizer` 特殊标记符（Special Tokens）到其对应的数字ID的映射。

一些常见的特殊标记符定义包括:
1. unk_token - 未登录词(out-of-vocabulary words)的标记id
2. sep_token - 句子分隔的标记id
3. pad_token - 填充序列到相等长度时使用的填充标记id
4. cls_token - 分类任务中使用的分类标记id
5. mask_token - 掩码语言模型任务中使用的掩码标记id


#### safetensors文件

safetensors 是一种安全快速存储和加载tensors的文件格式。通常，`PyTorch` 模型权重会使用Python的 `pickle` 工具将数据序列化到一个 `.bin` 文件中。但是 `pickle` 不安全，`pickle` 的文件可能包含可以执行的恶意代码。`safetensors` 是 `pickle` 的一个安全替代方案，非常适合共享模型权重。

详情请参考 [Using Safetensors](https://huggingface.co/docs/diffusers/main/en/using-diffusers/using_safetensors)