# How to train a new language model from scratch using Transformers and Tokenizers

Copyright 2020, Denis Rothman. Denis Rothman adapted a Hugging Face reference notebook to pretrain a transformer model.The next steps would be work on the building a larger dataset and testing several transformer models. 

The Transformer model of this Notebook is a Transformer model named ***KantaiBERT***. ***KantaiBERT*** is trained as a RoBERTa Transformer with DistilBERT architecture. The dataset was compiled with three books by Immanuel Kant downloaded from the [Gutenberg Project](https://www.gutenberg.org/). 

<img src="https://eco-ai-horizons.com/data/Kant.jpg" style="margin: auto; display: block; width: 260px;">

![](https://commons.wikimedia.org/wiki/Kant_gemaelde_1.jpg)

***KantaiBERT*** was pretrained with a small model of 84 million parameters using the same number of layers and heads as DistilBert, i.e., 6 layers, 768 hidden size,and 12 attention heads. ***KantaiBERT*** is then fine-tuned for a downstream masked Language Modeling task.

### The Hugging Face original Reference and notes:

Notebook edition (link to original of the reference blogpost [link](https://huggingface.co/blog/how-to-train)).


In [1]:
#@title Step 1: Loading the Dataset
#1.Load kant.txt using the Colab file manager
#2.Downloading the file from GitHub
!curl -L https://raw.githubusercontent.com/PacktPublishing/Transformers-for-Natural-Language-Processing/master/Chapter03/kant.txt --output "kant.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 25 10.7M   25 2850k    0     0  6464k      0  0:00:01 --:--:--  0:00:01 6450k
 96 10.7M   96 10.3M    0     0  7365k      0  0:00:01  0:00:01 --:--:-- 7360k
100 10.7M  100 10.7M    0     0  7418k      0  0:00:01  0:00:01 --:--:-- 7418k


In [2]:
#@title Step 2:Installing Hugging Face Transformers
# We won't need TensorFlow here
#!pip uninstall -y tensorflow
# Install `transformers` from master
#!pip install git+https://github.com/huggingface/transformers
#!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0

In [3]:
#@title Step 3: Training a Tokenizer
#%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [4]:
!cd

c:\Users\trodriguez\src\ml\Transformers-for-Natural-Language-Processing\Chapter03


In [5]:
#@title Step 4: Saving the files to disk
import os
token_dir = './KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
#tokenizer.save_model('KantaiBERT')
tokenizer.save_model(token_dir)

['./KantaiBERT\\vocab.json', './KantaiBERT\\merges.txt']

In [6]:
#@title Step 5 Loading the Trained Tokenizer Files 
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt",
)

In [7]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [8]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [9]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [10]:
#@title Step 6: Checking Resource Constraints: GPU and NVIDIA 
!nvidia-smi

Thu Sep  2 23:46:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 471.96       Driver Version: 471.96       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   62C    P8    N/A /  N/A |     64MiB /  2048MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [11]:
#@title Checking that PyTorch Sees CUDAnot
import torch
torch.cuda.is_available()

True

In [12]:
#@title Step 7: Defining the configuration of the Model
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [13]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.3.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



In [14]:
#@title Step 8: Re-creating the Tokenizer in Transformers
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

In [15]:
#@title Step 9: Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [16]:
print(model.num_parameters())

83504416


In [17]:
#@title Exploring the Parameters
LP=list(model.parameters())
lp=len(LP)
print(lp)
for p in range(0,lp):
  print(LP[p])

106
Parameter containing:
tensor([[ 0.0031,  0.0005, -0.0037,  ...,  0.0045,  0.0037,  0.0157],
        [ 0.0129,  0.0081, -0.0406,  ...,  0.0076,  0.0004, -0.0029],
        [ 0.0291,  0.0123, -0.0061,  ...,  0.0506,  0.0073, -0.0318],
        ...,
        [-0.0009, -0.0263,  0.0140,  ..., -0.0077,  0.0019, -0.0143],
        [-0.0018, -0.0172, -0.0062,  ...,  0.0011,  0.0419,  0.0297],
        [ 0.0126,  0.0355, -0.0138,  ..., -0.0206,  0.0038, -0.0271]],
       requires_grad=True)
Parameter containing:
tensor([[-0.0264,  0.0140, -0.0299,  ...,  0.0010,  0.0018, -0.0078],
        [ 0.0082, -0.0277, -0.0529,  ..., -0.0199, -0.0023, -0.0080],
        [-0.0090,  0.0121, -0.0231,  ...,  0.0238,  0.0145,  0.0244],
        ...,
        [-0.0084, -0.0081,  0.0147,  ...,  0.0256, -0.0258, -0.0163],
        [-0.0291, -0.0022, -0.0143,  ..., -0.0231, -0.0187,  0.0175],
        [-0.0065, -0.0243,  0.0071,  ...,  0.0079, -0.0020, -0.0382]],
       requires_grad=True)
Parameter containing:
tensor([

In [18]:
#@title Counting the parameters
np=0
for p in range(0,lp):#number of tensors
  PL2=True
  try:
    L2=len(LP[p][0]) #check if 2D
  except:
    L2=1             #not 2D but 1D
    PL2=False
  L1=len(LP[p])      
  L3=L1*L2
  np+=L3             # number of parameters per tensor
  if PL2==True:
    print(p,L1,L2,L3)  # displaying the sizes of the parameters
  if PL2==False:
    print(p,L1,L3)  # displaying the sizes of the parameters

print(np)              # total number of parameters

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

In [19]:
#@title Step 10: Building the Dataset
#%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./kant.txt",
    block_size=128,
)



In [20]:
#@title Step 11: Defining a Data Collator
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [21]:
#@title Step 12: Initializing the Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    #per_device_train_batch_size=64,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

W&B installed but not logged in. Run `wandb login` or set the WANDB_API_KEY env variable.


In [22]:
#@title Step 13: Pre-training the Model
#%%time
trainer.train()

  0%|          | 1/21371 [00:00<2:55:13,  2.03it/s]

RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 2.00 GiB total capacity; 1.35 GiB already allocated; 0 bytes free; 1.39 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFBB2E275A200007FFBB2E27540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFBB2DC9C0600007FFBB2DC9B90 c10_cuda.dll!c10::CUDAOutOfMemoryError::CUDAOutOfMemoryError [<unknown file> @ <unknown line number>]
00007FFBB2DD069600007FFBB2DCF370 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFBB2DD083A00007FFBB2DCF370 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFBB2DC509900007FFBB2DC4EB0 c10_cuda.dll!c10::cuda::CUDAStream::unpack [<unknown file> @ <unknown line number>]
00007FFB19E17D3100007FFB19E17C10 torch_cuda.dll!THCStorage_resizeBytes [<unknown file> @ <unknown line number>]
00007FFB194E46B700007FFB194E26E0 torch_cuda.dll!at::native::replication_pad3d_out_cuda [<unknown file> @ <unknown line number>]
00007FFB19E22F7400007FFB19E22E90 torch_cuda.dll!THCTensor_resizeAs [<unknown file> @ <unknown line number>]
00007FFB18B9377000007FFB18B936C0 torch_cuda.dll!THNN_CudaClassNLLCriterion_updateGradInput [<unknown file> @ <unknown line number>]
00007FFB19E094E400007FFB19D7E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFB19DE3D2D00007FFB19D7E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFB19DD123F00007FFB19D7E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFB415DC41D00007FFB415D8FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFB4161596800007FFB416158D0 torch_cpu.dll!at::nll_loss_backward [<unknown file> @ <unknown line number>]
00007FFB428E24F400007FFB4287E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFB4155256F00007FFB414FD9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFB415DC41D00007FFB415D8FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFB4161596800007FFB416158D0 torch_cpu.dll!at::nll_loss_backward [<unknown file> @ <unknown line number>]
00007FFB427E574A00007FFB427E5580 torch_cpu.dll!torch::autograd::generated::NllLossBackward::apply [<unknown file> @ <unknown line number>]
00007FFB427B7E9100007FFB427B7B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FFB42D1F9BA00007FFB42D1F300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FFB42D203AD00007FFB42D1FFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FFB42D24FE200007FFB42D24CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FFB42D24C4100007FFB42D24BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FFB0BCF090700007FFB0BCC9FE0 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FFB42D1BF1400007FFB42D1B780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FFBE4421BB200007FFBE4421B20 ucrtbase.dll!configthreadlocale [<unknown file> @ <unknown line number>]
00007FFBE477703400007FFBE4777020 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFBE652265100007FFBE6522630 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]


In [None]:
#@title Step 14: Saving the Final Model(+tokenizer + config) to disk
trainer.save_model("./KantaiBERT")

In [None]:
#@title Step 15: Language Modeling with the FillMaskPipeline
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./KantaiBERT",
    tokenizer="./KantaiBERT"
)

Some weights of RobertaModel were not initialized from the model checkpoint at ./KantaiBERT and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
fill_mask("Human thinking involves<mask>.")

[{'score': 0.010303723625838757,
  'sequence': '<s>Human thinking involves reason.</s>',
  'token': 394,
  'token_str': 'Ġreason'},
 {'score': 0.010289391502737999,
  'sequence': '<s>Human thinking involves priori.</s>',
  'token': 578,
  'token_str': 'Ġpriori'},
 {'score': 0.009549057111144066,
  'sequence': '<s>Human thinking involves conceptions.</s>',
  'token': 610,
  'token_str': 'Ġconceptions'},
 {'score': 0.008349979296326637,
  'sequence': '<s>Human thinking involves experience.</s>',
  'token': 535,
  'token_str': 'Ġexperience'},
 {'score': 0.00743826711550355,
  'sequence': '<s>Human thinking involves will.</s>',
  'token': 487,
  'token_str': 'Ġwill'}]