<a href="https://colab.research.google.com/github/skayasare/Transformers-for-NLP/blob/main/Train_KantaiBERT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#How to train a new language model from scratch using Transformers and Tokenizers


The Transformer model of this notebook is a Transformer model named *KantaiBERT. KantaiBERT* is trained as a RoBERTa Transformer with DistilBERT architecture

#Step 1: Loading the dataset

In [1]:
from IPython.display import Image     #This is used for rendering images in the notebook

In [2]:
import requests
from PIL import Image
from io import BytesIO
import requests
from PIL import Image

def get_image_from_github(image_name):
    # The base URL of the image files in the GitHub repository
    base_url = 'https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/main/Notebook%20images/04/'

    # Make the request
    response = requests.get(base_url + image_name)

    # Check if the request was successful
    if response.status_code == 200:
        # Read the image
        image = Image.open(BytesIO(response.content))

        # Return the image
        return image
    else:
        print(f'Error {response.status_code}: Could not access the image file.')
        return None

In [3]:
#1.Load kant.txt using the Colab file manager
# or
#2.Download the file from GitHub
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter04/kant.txt --output "kant.txt"


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  12.2M      0 --:--:-- --:--:-- --:--:-- 12.2M


#Step 2: Installing Hugging Face transformers

In [4]:
!pip install Transformers
!pip install --upgrade accelerate
from accelerate import Accelerator

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1


#Step 3: Training a tokenizer

Hugging Face’s ByteLevelBPETokenizer() will be trained using kant.txt. A BPE tokenizer will
break a string or word down into substrings or subwords.

In [5]:
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Read the content from the files, ignoring or replacing invalid characters
file_contents = []
for path in paths:
    try:
        with open(path, 'r', encoding='utf-8', errors='replace') as file:
            file_contents.append(file.read())
    except Exception as e:
        print(f"Error reading {path}: {e}")

# Join the contents into a single string
text = "\n".join(file_contents)

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train_from_iterator([text], vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])

#Step 4: Saving the files to disk

The tokenizer will generate two files when trained:
*   merges.txt, which contains the merged tokenized substrings
*   vocab.json, which contains the indices of the tokenized substrings



In [6]:
import os
token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

#Step 5: Loading the trained tokenizer files

In [7]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt",
)

In [8]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [9]:

tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

The tokenizer now processes the tokens to fit the BERT model variant used in this notebook. The
post-processor will add a start and end token; for example:

In [10]:
tokenizer._tokenizer.post_processor = BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

Let’s encode a post-processed sequence:

In [11]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

If we want to see what was added, we can ask the tokenizer to encode the post-processed sequence
by running the following cell:

In [12]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

In [13]:
import torch
torch.cuda.is_available()

True

#Step 7: Defining the configuration of the model

We will be pretraining a RoBERTa-type transformer model using the same number of layers
and heads as a DistilBERT transformer. The model will have a vocabulary size set to 52,000, 12
attention heads, and 6 layers:

In [14]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [15]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



#Step 8: Reloading the tokenizer in transformers

We are now ready to load our trained tokenizer, which is our pretrained tokenizer in
RobertaTokenizer.from_pretained()

In [16]:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

#Step 9: Initializing a model from scratch

First import a RoBERTa masked model for language modeling.
The model is initialized with the configuration, defined in Step 7

In [17]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): La

#Exploring the Parameters

In [18]:
print(model.num_parameters())

83504416


Let’s now look into the parameters. We first store the parameters in LP and calculate the length
of the list of parameters

In [19]:
LP = list(model.parameters())
lp = len(LP)
print(lp)

106


Now, let’s display the 108 matrices and vectors in the tensors that contain them:
The number of parameters is calculated by taking all parameters in the model and adding them up; for example:


*   The vocabulary (52,000) x dimensions (768)
*   The size of the vectors is 1 x 768


In [20]:
for p in range(0,lp):

  print(LP[p])

Parameter containing:
tensor([[ 0.0176, -0.0101, -0.0004,  ...,  0.0411, -0.0250, -0.0376],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0027,  0.0172, -0.0417,  ..., -0.0184,  0.0303, -0.0007],
        ...,
        [ 0.0001, -0.0315, -0.0038,  ..., -0.0311,  0.0278, -0.0065],
        [ 0.0068,  0.0077,  0.0107,  ..., -0.0109, -0.0252,  0.0098],
        [ 0.0134, -0.0399,  0.0279,  ...,  0.0140,  0.0184,  0.0052]],
       requires_grad=True)
Parameter containing:
tensor([[-0.0325, -0.0143, -0.0030,  ...,  0.0065, -0.0257,  0.0365],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0216, -0.0182,  0.0006,  ..., -0.0023, -0.0082,  0.0029],
        ...,
        [-0.0124, -0.0145,  0.0361,  ...,  0.0059,  0.0474, -0.0244],
        [ 0.0036, -0.0105,  0.0168,  ..., -0.0113, -0.0099,  0.0300],
        [ 0.0364,  0.0257,  0.0064,  ..., -0.0026,  0.0002, -0.0188]],
       requires_grad=True)
Parameter containing:
tensor([[-8.

In [21]:
#Shape of each tensor in the model
LP = list(model.parameters())
for i, tensor in enumerate(LP):
    print(f"Shape of tensor {i}: {tensor.shape}")

Shape of tensor 0: torch.Size([52000, 768])
Shape of tensor 1: torch.Size([514, 768])
Shape of tensor 2: torch.Size([1, 768])
Shape of tensor 3: torch.Size([768])
Shape of tensor 4: torch.Size([768])
Shape of tensor 5: torch.Size([768, 768])
Shape of tensor 6: torch.Size([768])
Shape of tensor 7: torch.Size([768, 768])
Shape of tensor 8: torch.Size([768])
Shape of tensor 9: torch.Size([768, 768])
Shape of tensor 10: torch.Size([768])
Shape of tensor 11: torch.Size([768, 768])
Shape of tensor 12: torch.Size([768])
Shape of tensor 13: torch.Size([768])
Shape of tensor 14: torch.Size([768])
Shape of tensor 15: torch.Size([3072, 768])
Shape of tensor 16: torch.Size([3072])
Shape of tensor 17: torch.Size([768, 3072])
Shape of tensor 18: torch.Size([768])
Shape of tensor 19: torch.Size([768])
Shape of tensor 20: torch.Size([768])
Shape of tensor 21: torch.Size([768, 768])
Shape of tensor 22: torch.Size([768])
Shape of tensor 23: torch.Size([768, 768])
Shape of tensor 24: torch.Size([768])
Sh

In [22]:
#counting the parameters
np=0
for p in range(0,lp):#number of tensors
  PL2=True
  try:
    L2=len(LP[p][0]) #check if 2D
  except:
    L2=1             #not 2D but 1D
    PL2=False
  L1 = len(LP[p])
  L3 = L1*L2
  np+= L3             # number of parameters per tensor
  if PL2 == True:
    print(p,L1,L2,L3)  # displaying the sizes of the parameters
  if PL2 == False:
    print(p,L1,L3)  # displaying the sizes of the parameters

print(np)              # total number of parameters


0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

#Step 10: Building the dataset

In [23]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer = tokenizer,
    file_path = "./kant.txt",
    block_size = 128,
)



CPU times: user 30.7 s, sys: 775 ms, total: 31.4 s
Wall time: 40.5 s


#Step 11: Defining a data collator

We need to run a data collator before initializing the trainer. A data collator will take samples
from the dataset and collate them into batches. The results are dictionary-like objects.

We are preparing a batched sample process for MLM by setting mlm=True.
We also set the number of masked tokens to train mlm_probability=0.15. This will determine
the percentage of tokens masked during the pretraining process.

In [24]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)


#Step 12: Initializing the trainer

In [25]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=5, #can be increased
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

#Step 13: Pretraining the model

In [28]:
%%time
trainer.train()

Step,Training Loss
500,4.9083
1000,4.6128
1500,4.3517
2000,4.1973
2500,4.1085
3000,3.9448
3500,3.871
4000,3.8087
4500,3.7105
5000,3.6563


CPU times: user 45min 55s, sys: 10.3 s, total: 46min 5s
Wall time: 46min 23s


TrainOutput(global_step=13360, training_loss=3.606951273843914, metrics={'train_runtime': 2782.8197, 'train_samples_per_second': 307.178, 'train_steps_per_second': 4.801, 'total_flos': 4601476742144256.0, 'train_loss': 3.606951273843914, 'epoch': 5.0})

#Step 14: Saving the final model (+tokenizer + config) to disk

In [29]:
trainer.save_model("./KantaiBERT")

#Step 15: Language modeling with FillMaskPipeline

We will now import a language modeling fill-mask task. We will use our trained model and
trained tokenizer to perform MLM

In [33]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./KantaiBERT",
    tokenizer="./KantaiBERT"
)


In [34]:
fill_mask("Human thinking involves human <mask>.")

[{'score': 0.44560420513153076,
  'token': 393,
  'token_str': ' reason',
  'sequence': 'Human thinking involves human reason.'},
 {'score': 0.12695765495300293,
  'token': 587,
  'token_str': ' nature',
  'sequence': 'Human thinking involves human nature.'},
 {'score': 0.04443361982703209,
  'token': 987,
  'token_str': ' mind',
  'sequence': 'Human thinking involves human mind.'},
 {'score': 0.03229430317878723,
  'token': 485,
  'token_str': ' will',
  'sequence': 'Human thinking involves human will.'},
 {'score': 0.015074925497174263,
  'token': 723,
  'token_str': ' laws',
  'sequence': 'Human thinking involves human laws.'}]