<a href="https://colab.research.google.com/github/wwh133/Transformer/blob/main/%E7%AC%AC4%E7%AB%A0_%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E6%9E%84%E5%BB%BA%E9%A2%84%E8%AE%AD%E7%BB%83%E6%A8%A1%E5%9E%8B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train a new language model from scratch using Transformers and Tokenizers

Copyright 2021, Denis Rothman. Denis Rothman adapted a Hugging Face reference notebook to pretrain a transformer model. The next steps would be to work on building a larger dataset and testing several transformer models.

It is recommended to understand this notebook. The emergence of GPT-3 engines has produced an API that can outperform many trained transformers models. However, to show a transformer what to do with the input datasets, it is essential to understand how they are trained.

The Transformer model of this Notebook is a Transformer model named ***KantaiBERT***. ***KantaiBERT*** is trained as a RoBERTa Transformer with DistilBERT architecture. The dataset was compiled with three books by Immanuel Kant downloaded from the [Gutenberg Project](https://www.gutenberg.org/).


***KantaiBERT*** was pretrained with a small model of 84 million parameters using the same number of layers and heads as DistilBert, i.e., 6 layers, 768 hidden size,and 12 attention heads. ***KantaiBERT*** is then fine-tuned for a downstream masked Language Modeling task.

### The Hugging Face original Reference and notes:

Notebook edition (link to original of the reference blogpost [link](https://huggingface.co/blog/how-to-train)).




> 属于RoBERTa：**字节对**编码（BPE）词元化方法



In [1]:
from IPython.display import Image     #This is used for rendering images in the notebook

In [2]:
#@title Step 1: Loading the Dataset
#1.Load kant.txt using the Colab file manager
#2.Downloading the file from GitHubant
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter04/kant.txt --output "kant.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  16.6M      0 --:--:-- --:--:-- --:--:-- 16.6M


In [3]:
#@title Step 2:APRIL 2023 UPDATE: Installing Hugging Face Transformers
'''
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0
'''

"\n# We won't need TensorFlow here\n!pip uninstall -y tensorflow\n# Install `transformers` from master\n!pip install git+https://github.com/huggingface/transformers\n!pip list | grep -E 'transformers|tokenizers'\n# transformers version at notebook update --- 2.9.1\n# tokenizers version at notebook update --- 0.7.0\n"

In [29]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0

Found existing installation: tensorflow 2.15.0
Uninstalling tensorflow-2.15.0:
  Successfully uninstalled tensorflow-2.15.0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-xksb46u8
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-xksb46u8
  Resolved https://github.com/huggingface/transformers to commit 20081c743ee2ce31d178f2182c7466c3313adcd2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.41.0.dev0-py3-none-any.whl size=9040222 sha256=37cacd21f39b16530fab952623a71720d0f0b01440fd32b063c22435e65b12a3
  Stored in directory: /tmp/pip-e

tokenizers                       0.19.1
transformers                     4.41.0.dev0


April 2023 update From Hugging Face Issue 22816:

https://github.com/huggingface/transformers/issues/22816

"The PartialState import was added as a dependency on the transformers development branch yesterday. PartialState was added in the 0.17.0 release in accelerate, and so for the development branch of transformers, accelerate >= 0.17.0 is required.

Downgrading the transformers version removes the code which is importing PartialState."

Denis Rothman: The following cell imports the latest version of Hugging Face transformers but without downgrading it.

To adapt to the Hugging Face upgrade, A GPU accelerator was activated using the Google Colab Pro with the following NVIDIA GPU:
GPU Name: NVIDIA A100-SXM4-40GB

In [27]:
!pip install transformers



In [2]:
#@title Step 3: Training a Tokenizer词元分析器
%%time
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training 数据集路径、词表大小、最小频率阈值、特殊词元列表
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",  # 开始
    "<pad>", # 填充
    "</s>", # 结束
    "<unk>", # 未知
    "<mask>", # 掩码
])

CPU times: user 13.1 s, sys: 3.45 s, total: 16.6 s
Wall time: 1.46 s


In [3]:
#@title Step 4: Saving the files to disk 【 vocab.json词元化后索引、merges.txt词元化后结果】
import os
token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

In [4]:
#@title Step 5 Loading the Trained Tokenizer Files
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt",
)

In [5]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [6]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [7]:
         # 分词器的后处理器
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")), # 句子结束的特殊标记
    ("<s>", tokenizer.token_to_id("<s>")), # 句子开始的特殊标记
)
tokenizer.enable_truncation(max_length=512) # 分词器的截断

In [8]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

In [9]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [10]:
#@title Step 6: Checking Resource Constraints: GPU and NVIDIA
!nvidia-smi

Fri Apr 26 14:30:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   46C    P8              17W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [11]:
#@title Checking that PyTorch Sees CUDA
import torch
torch.cuda.is_available()

True

In [12]:
#@title Step 7: Defining the configuration of the Model
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1, # 类型词汇表的大小，用于区分不同的输入序列。——意味着只有一个类型
)

In [13]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.40.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



In [14]:
#@title Step 8: Re-creating the Tokenizer in Transformers
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

In [15]:
#@title Step 9: Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): La

#Exploring the parameters

In [16]:
print(model.num_parameters())
# => 84,095,008 parameters

83504416


In [17]:
#@title Exploring the Parameters
LP=list(model.parameters())
lp=len(LP)
print(lp)


106


In [18]:
for p in range(0,lp):
  print(LP[p])

Parameter containing:
tensor([[ 0.0050, -0.0018, -0.0128,  ..., -0.0139,  0.0137,  0.0179],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0106,  0.0125,  0.0337,  ..., -0.0135, -0.0082,  0.0052],
        ...,
        [ 0.0188, -0.0017, -0.0140,  ...,  0.0155,  0.0166, -0.0214],
        [ 0.0250, -0.0276,  0.0050,  ..., -0.0153,  0.0285,  0.0006],
        [-0.0226, -0.0372, -0.0173,  ...,  0.0065, -0.0243,  0.0310]],
       requires_grad=True)
Parameter containing:
tensor([[ 0.0081,  0.0019, -0.0260,  ..., -0.0319, -0.0005, -0.0046],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0226, -0.0090, -0.0045,  ...,  0.0003, -0.0140,  0.0131],
        ...,
        [ 0.0127,  0.0157, -0.0047,  ...,  0.0157,  0.0155, -0.0212],
        [-0.0090, -0.0595, -0.0092,  ...,  0.0109, -0.0339,  0.0080],
        [ 0.0090,  0.0019,  0.0007,  ..., -0.0180, -0.0469,  0.0060]],
       requires_grad=True)
Parameter containing:
tensor([[ 0.

In [19]:
#@title Counting the parameters
np=0
for p in range(0,lp):#number of tensors
  PL2=True
  try:
    L2=len(LP[p][0]) #check if 2D
  except:
    L2=1             #not 2D but 1D
    PL2=False
  L1=len(LP[p])
  L3=L1*L2
  np+=L3             # number of parameters per tensor
  if PL2==True:
    print(p,L1,L2,L3)  # displaying the sizes of the parameters
  if PL2==False:
    print(p,L1,L3)  # displaying the sizes of the parameters

print(np)              # total number of parameters

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

In [20]:
#@title Step 10: Building the Dataset
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./kant.txt",
    block_size=128,
)



CPU times: user 19.3 s, sys: 236 ms, total: 19.6 s
Wall time: 19.5 s


In [21]:
#@title Step 11: Defining a Data Collator 数据整理器
from transformers import DataCollatorForLanguageModeling
             # MLM 、词元掩码%
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [22]:
import transformers
import accelerate

print("transformers version:", transformers.__version__)
print("accelerate version:", accelerate.__version__)

transformers version: 4.40.1
accelerate version: 0.29.3


In [None]:
!pip uninstall transformers accelerate
!pip install transformers[torch]

In [23]:
#@title Step 12: Initializing the Trainer 训练器
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=1, # 模型在整个训练集上训练一次
    per_device_train_batch_size=64,# 每个设备的训练批次大小:每个设备每次训练将处理64个样本
    save_steps=10_000, # 保存检查点的步数间隔:每处理10,000个批次，就会保存一个模型检查点
    save_total_limit=2, # 只会保存最近的两个检查点。
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [24]:
#@title Step 13: Pre-training the Model
%%time
trainer.train()

Step,Training Loss
500,6.5998
1000,5.7253
1500,5.2464
2000,5.0008
2500,4.8913


CPU times: user 5min 12s, sys: 1.05 s, total: 5min 13s
Wall time: 5min 12s


TrainOutput(global_step=2672, training_loss=5.449815578803331, metrics={'train_runtime': 311.8825, 'train_samples_per_second': 548.168, 'train_steps_per_second': 8.567, 'total_flos': 873691623267840.0, 'train_loss': 5.449815578803331, 'epoch': 1.0})

In [25]:
#@title Step 14: Saving the Final Model(+tokenizer + config) to disk
trainer.save_model("./KantaiBERT")

*   Saving model checkpoint to ./KantaiBERT
*   Configuration saved in ./KantaiBERT/config.json
*   Model weights saved in ./KantaiBERT/pytorch_model.bin

In [26]:
#@title Step 15: Language Modeling with the FillMaskPipeline 语言建模
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./KantaiBERT",
    tokenizer="./KantaiBERT"
)

In [27]:
fill_mask("Human thinking involves human <mask>.")

[{'score': 0.039070095866918564,
  'token': 394,
  'token_str': ' reason',
  'sequence': 'Human thinking involves human reason.'},
 {'score': 0.014763986691832542,
  'token': 584,
  'token_str': ' intuition',
  'sequence': 'Human thinking involves human intuition.'},
 {'score': 0.014117002487182617,
  'token': 448,
  'token_str': ' law',
  'sequence': 'Human thinking involves human law.'},
 {'score': 0.011584443971514702,
  'token': 535,
  'token_str': ' experience',
  'sequence': 'Human thinking involves human experience.'},
 {'score': 0.010539493523538113,
  'token': 611,
  'token_str': ' conceptions',
  'sequence': 'Human thinking involves human conceptions.'}]

至此已经从头开始训练了一个transformer模型；
* 可以为特定任务创建一个数据集并从头开始训练一个模型；
* 制作出自己喜欢的模型后可将其与 hugging face社区分享模型将展示在 [hugging face模型页面](https://huggingface.co/models)

具体信息可以参考以下文档[在hugging face分享模型](https://huggingface.co/docs/transformers/model_sharing)