### Building KantaiBERT from scratch

In [1]:
#@ Loading the dataset
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter04/kant.txt --output "kant.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  23.0M      0 --:--:-- --:--:-- --:--:-- 22.9M


In [2]:
#@ Installing Hugging Face Transformers
!pip install transformers                               # Installing transformers.
!pip list | grep -E "transformers|tokenizers"           # Inspecting versions.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m60.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m87.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

**Step3: Training a tokenizer**

HuggingFaces `ByteLevelBPETokenizer()` will be trained using kant.txt.

A BPE tokenizer will break a string or word down into substrings or subwords

Advantages of doing this are:
- The tokenizer can break words into minimal componenets. Then it will merge these small components into statistically interesting ones.
- The chunks of strings classified as unknown, unk_token using WordPiece level encoding will practically disappear

In [4]:
#@ Training a Tokenizer
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()                         # Initialize a tokenizer
tokenizer.train(files=paths,                                # Customize the training
                vocab_size=52_000,
                min_frequency=2,
                special_tokens=[
                    "<s>",
                    "<pad>",
                    "</s>",
                    "<unk>",
                    "<mask>",
                ])

CPU times: user 7.69 s, sys: 342 ms, total: 8.03 s
Wall time: 1min 53s


**Step4: Saving the files to disk**

The tokenizer will generate two files when trained:
- merges.txt, which contains the merged tokenized substrings
- vocab.json, which contains the indices of the tokenized substrings

In [5]:
#@ Saving the files to a disk
import os
token_dir = '/content/KantaiBERT'                    # Initialization
if not os.path.exists(token_dir):
  os.makedirs(token_dir)                             # Creating the directory if not available
tokenizer.save_model('KantaiBERT')                   # Saving the tokenizer

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

**Step5: Loading the trained tokenizer files**

We could have loaded pretrained tokenizer files. However, we trained our own tokenizer and now are ready to load the files

In [6]:
#@ Loading the Trained Tokenizer Files
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt"
)

In [7]:
#@ Implementing the trained tokenizer
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [8]:
#@ Checking the number of tokens in the sentence
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [9]:
#@ Adding the post procecssor for start and end token
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)

tokenizer.enable_truncation(max_length=512)

In [10]:
#@ Encoding the post-processed sequence
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [11]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

In [12]:
#@ Checking resource constraints: GPU and CUDA
!nvidia-smi

Sun Jun 25 11:56:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [13]:
#@ Checking that PyTorch Sees CUDA
import torch
torch.cuda.is_available()

True

**Step7: Defining the configuration of the model**

- Here, we will be pretraining a RoBERTa-type transformer model using the same number of layers and heads as a DistilBERT transformer. The model will have a vocabulary size set to 52,000, 12 attention heads, and 6 layers

In [14]:
#@ Defining the configuration of the model
from transformers import RobertaConfig
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [15]:
#@ Step8: Reloading the tokenizer in transformer
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT",
                                             max_length=512)

In [16]:
#@ Step9: Initializing a model from scratch
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)                  # Initializing the Model
print(model)                                               # Inspecting the model

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): La

In [17]:
#@ EXPLORING THE PARAMETERS
print(model.num_parameters())                      # Inspecting number of parameters

83504416


In [18]:
LP = list(model.parameters())
lp = len(LP)
print(lp)

106


- It shows that there are approximately 106 matrices and vectors, which might vary from one transformer model to another

In [19]:
#@ Displaying all the parameters as
for p in range(0, lp):
  print(LP[p])

Parameter containing:
tensor([[ 0.0090, -0.0031,  0.0173,  ..., -0.0076,  0.0087, -0.0163],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0129,  0.0097, -0.0057,  ...,  0.0116,  0.0294,  0.0126],
        ...,
        [-0.0049, -0.0026,  0.0226,  ...,  0.0051, -0.0099, -0.0153],
        [ 0.0299, -0.0015,  0.0414,  ...,  0.0140,  0.0048, -0.0027],
        [-0.0012,  0.0132,  0.0055,  ..., -0.0130, -0.0124, -0.0124]],
       requires_grad=True)
Parameter containing:
tensor([[ 0.0167,  0.0077, -0.0266,  ..., -0.0538, -0.0362, -0.0148],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0260, -0.0155,  0.0198,  ...,  0.0129, -0.0215, -0.0062],
        ...,
        [ 0.0275, -0.0179,  0.0065,  ...,  0.0248,  0.0312,  0.0167],
        [-0.0019,  0.0257, -0.0100,  ...,  0.0137,  0.0237,  0.0206],
        [-0.0123, -0.0193,  0.0329,  ..., -0.0216,  0.0339, -0.0252]],
       requires_grad=True)
Parameter containing:
tensor([[-2.

In [20]:
#@ Building the dataset
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./kant.txt",
    block_size=128
)



CPU times: user 29.6 s, sys: 407 ms, total: 30 s
Wall time: 39 s


**Step11: Defining a data collator**

- We need to run a data collator before initializing the trainer. A data collator will take samples from the dataset and collate them into branches

- We also set the number of masked tokens to train `mlm_probability=0.15`. This will determine the percentage of tokens masked duirng pretrainng process

- We now initialize `data_collator` with our tokenizer, MLM activated, and the proportion of masked tokens set to 0,15

In [21]:
#@ Defining a data collator
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=True,
                                                mlm_probability=0.15
                                                )

**Step12: Initializing the trainer**

- We had prepared the information required to initialize the trainer. The dataset has been tokenized and loaded. Our model is built. The data collator has been created.

- Now, we can initialize the trainer.

In [22]:
#@ Initializing the trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [23]:
#@ Step13: Pretraining the model
%%time
trainer.train()



Step,Training Loss
500,6.5957
1000,5.6933
1500,5.2167
2000,4.9686
2500,4.8234


CPU times: user 9min 17s, sys: 2.21 s, total: 9min 19s
Wall time: 9min 31s


TrainOutput(global_step=2672, training_loss=5.41564306384789, metrics={'train_runtime': 571.5281, 'train_samples_per_second': 299.135, 'train_steps_per_second': 4.675, 'total_flos': 873620128952064.0, 'train_loss': 5.41564306384789, 'epoch': 1.0})

**Step14: Saving the final model (+tokenizer + config) to disk**

In [24]:
#@ Saving the Final Model to disk
trainer.save_model("./KantaiBERT")

**Step15: Language Modeling with FillMaskPipeline**

- We will now import a lanauge modeling fill-mask task.. We will use our trained dmodel and trained tokenizer to perform MLM

In [25]:
#@ Lanauge Modeling with FillMaskPipeline
from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="./KantaiBERT",
    tokenizer="./KantaiBERT"
)

In [26]:
fill_mask("Human Thinking involves human <mask>.")

[{'score': 0.020256564021110535,
  'token': 393,
  'token_str': ' reason',
  'sequence': 'Human Thinking involves human reason.'},
 {'score': 0.012400107458233833,
  'token': 601,
  'token_str': ' understanding',
  'sequence': 'Human Thinking involves human understanding.'},
 {'score': 0.008811179548501968,
  'token': 613,
  'token_str': ' principle',
  'sequence': 'Human Thinking involves human principle.'},
 {'score': 0.007971439510583878,
  'token': 586,
  'token_str': ' nature',
  'sequence': 'Human Thinking involves human nature.'},
 {'score': 0.007662993390113115,
  'token': 671,
  'token_str': ' principles',
  'sequence': 'Human Thinking involves human principles.'}]

**The End**