<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/3-pretraining-RoBERTa-model-from-scratch/building_KantaiBERT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building KantaiBERT from scratch using Transformers and Tokenizers

The Transformer model of this Notebook is a Transformer model named ***KantaiBERT***. ***KantaiBERT*** is trained as a RoBERTa Transformer with DistilBERT architecture. The dataset was compiled with three books by Immanuel Kant downloaded from the [Gutenberg Project](https://www.gutenberg.org/). 

<center><img src="https://eco-ai-horizons.com/data/Kant.jpg" style="margin: auto; display: block; width: 260px;"></center>

![](https://commons.wikimedia.org/wiki/Kant_gemaelde_1.jpg)

***KantaiBERT*** was pretrained with a small model of 84 million parameters using the same number of layers and heads as DistilBert, i.e., 6 layers, 768 hidden size,and 12 attention heads. ***KantaiBERT*** is then fine-tuned for a downstream masked Language Modeling task.



## Setup

Notebook edition ([link to original of the reference blogpost](https://huggingface.co/blog/how-to-train)).

We will need to install Hugging Face transformers and tokenizers.

In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0

In [2]:
import os
from pathlib import Path

import torch

from tokenizers import ByteLevelBPETokenizer
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments, pipeline

## Step 1: Loading the Dataset

I chose to use the works of Immanuel Kant (1724-1804), the German philosopher, who was the epitome of the Age of Enlightenment. The idea is to introduce human-like logic and pretrained reasoning for downstream reasoning tasks.

I compiled the following three books by Immanuel Kant into a text file named `kant.txt`:

- The Critique of Pure Reason
- The Critique of Practical Reason
- Fundamental Principles of the Metaphysic of Morals

kant.txt provides a small training dataset to train the transformer model. The result obtained remains experimental. For a real-life project, I would
add the complete works of Immanuel Kant, Rene Descartes, Pascal, and Leibnitz, for example.

The text file contains the raw text of the books:

```
…For it is in reality vain to profess _indifference_ in regard to such
inquiries, the object of which cannot be indifferent to humanity.
```

In [3]:
#1.Load kant.txt using the Colab file manager
#2.Downloading the file from GitHub
!curl -L https://github.com/rahiakela/transformers-for-natural-language-processing/raw/main/3-pretraining-RoBERTa-model-from-scratch/kant.txt --output "kant.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   209  100   209    0     0    487      0 --:--:-- --:--:-- --:--:--   487
100 10.7M  100 10.7M    0     0  6832k      0  0:00:01  0:00:01 --:--:-- 32.3M


## Step 3: Training a Tokenizer

Since we does not use a pretrained tokenizer. For example, a pretrained GPT-2 tokenizer could be used. However, the training process includes training a tokenizer from scratch.

Hugging Face's `ByteLevelBPETokenizer()` will be trained using `kant.txt`. A bytelevel tokenizer will break a string or word down into a sub-string or sub-word.

There are two main advantages among many others:

- The tokenizer can break words into minimal components. Then it will merge
these small components into statistically interesting ones. For example,
"smaller" and smallest" can become "small," "er," and "est." The tokenizer
can go further, and we could obtain "sm" and "all," for example. In any case,
the words are broken down into sub-word tokens and smaller units of subword
parts such as "sm" and "all" instead of simply "small."

- The chunks of strings classified as an unknown `unk_token`, using WorkPiece
level encoding, will practically disappear.

In [4]:
%%time 

paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 5.7 s, sys: 401 ms, total: 6.1 s
Wall time: 1.64 s


The tokenizer will be trained to generate merged sub-string tokens and analyze their frequency.

Let's take these two words in the middle of a sentence:

```
…the tokenizer…
```

The first step will be to tokenize the string:

```
'Ġthe', 'Ġtoken', 'izer',
```

The string is now tokenized into tokens with Ġ (whitespace) information.

The next step is to replace them with their indices:

| 'Ġthe' | 'Ġtoken' | 'izer' |
| ---    | ------   | ----   |
| 150    | 5430     | 4712   |



## Step 4: Saving the files to disk

The tokenizer will generate two files when trained:

- `merges.txt`, which contains the merged tokenized sub-strings
- `vocab.json`, which contains the indices of the tokenized sub-strings

In [5]:
token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

##  Step 5 Loading the Trained Tokenizer Files

We could have loaded pretrained tokenizer files. However, we trained our own
tokenizer and now are ready to load the files:

In [6]:
tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt"
)

The tokenizer can encode a sequence:

In [7]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

We can also ask to see the number of tokens in this sequence:

In [8]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

The tokenizer now processes the tokens to fit the BERT model variant used. The post processor will add a start and end token, for example:

In [9]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)

tokenizer.enable_truncation(max_length=512)

Let's encode a post-processed sequence:

In [10]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

If we want to see what was added, we can ask the tokenizer to encode the postprocessed sequence.

In [11]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

The output shows that the start and end tokens have been added, which brings the number of tokens to 8 including start and end tokens.

## Step 6: Checking Resource Constraints: GPU and NVIDIA 

KantaiBERT runs at optimal speed with a Graphics Processing Unit (GPU).

We will first run a command to see if an NVIDIA GPU card is present:

In [12]:
# Checking Resource Constraints: GPU and NVIDIA 
!nvidia-smi

Fri Apr  2 06:41:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We will now check to make sure PyTorch sees CUDA:

In [13]:
torch.cuda.is_available()

True

Compute Unified Device Architecture (CUDA) was developed by NVIDIA to use
the parallel computing power of its NVIDIA card.

## Step 7: Defining the configuration of the Model

We will be pretraining a RoBERTa-type transformer model using the same number
of layers and heads as a DistilBERT transformer. The model will have a vocabulary size set to 52,000, 12 attention heads, and 6 layers.

In [14]:
# Defining the configuration of the Model
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [15]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.5.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



## Step 8: Re-creating the Tokenizer in Transformers

We are now ready to load our trained tokenizer, which is our pretrained tokenizer in `RobertaTokenizer.from_pretained()`.

In [16]:
# Re-creating the Tokenizer in Transformers
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

## Step 9: Initializing a Model From Scratch

we will initialize a model from scratch and examine the size of the
model.



In [17]:
# Initializing a Model From Scratch
model = RobertaForMaskedLM(config=config)
# If we print the model, we can see that it is a BERT model with 6 layers and 12 heads
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

The model is small and contains 83,504,416 parameters.

In [18]:
print(model.num_parameters())

83504416


**Exploring the Parameters**

Let's now look into the parameters. We first store the parameters in LP and calculate the length of the list of parameters.

In [None]:
# Exploring the Parameters
LP=list(model.parameters())
lp=len(LP)
print(lp)
for p in range(0,lp):
  print(LP[p])

The output shows that there are approximately 106 matrices and vectors, which
might vary from one transformer model to another.

**Counting the parameters**

The number of parameters is calculated by taking all parameters in the model and adding them up; for example:

- The vocabulary (52,000) x dimensions (768)
- The size of many vectors is 1 x 768
- The many other dimensions found

You will note that $d_{model} = 768$. There are 12 heads in the model. The dimension of $d_k$ for each head will thus be $d_k = \frac{d_{model}}{12} = 64$. This shows, once again, the optimized
Lego concept of the building blocks of a transformer.

We will take this further and count the number of parameters of each tensor.

First, the program initializes a parameter counter named np (number of parameters) and goes through the lp (108) number of elements in the list of parameters.

The parameters are matrices and vectors of different sizes; for example.
- 768 x 768
- 768 x 1
- 768

We can see that some parameters are two-dimensional, and some are onedimensional.

In [20]:
# Counting the parameters
np=0
for p in range(0, lp):#number of tensors
  # An easy way to find out is to try and see if a parameter p in the list LP[p] has two dimensions or not
  PL2=True
  try:
    L2=len(LP[p][0]) #check if 2D
  except:
    L2=1             #not 2D but 1D
    PL2=False
  # L1 is the size of the first dimension of the parameter. L3 is the size of the parameters defined by  
  L1=len(LP[p])      
  L3=L1*L2
  # We can now add the parameters up at each step of the loop
  np+=L3             # number of parameters per tensor

  """
  We will obtain the sum of the parameters, but we also want to see exactly how the
  number of parameters of a transformer model is calculated
  """
  if PL2==True:
    print(p,L1,L2,L3)  # displaying the sizes of the parameters
  if PL2==False:
    print(p,L1,L3)  # displaying the sizes of the parameters

print(np)              # total number of parameters

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

Note that if a parameter only has one dimension, `PL2=False`, then we only display the first dimension.

The total number of parameters of the RoBERTa model is displayed at the end of
the list: `83,504,416`

## Step 10: Building the Dataset

We will now load the dataset line by line for batch training with block_
size=128 limiting the length of an example.

In [21]:
# Step 10: Building the Dataset
%%time

dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path="./kant.txt", block_size=128)



CPU times: user 22 s, sys: 369 ms, total: 22.4 s
Wall time: 22.3 s


The output shows that Hugging Face has invested a considerable amount of
resources into optimizing the time it takes to process data.

We will now define a data collator to create an object for backpropagation.

## Step 11: Defining a Data Collator

We need to run a data collator before initializing the trainer. A data collator will take samples from the dataset and collate them into batches. The results are dictionarylike objects.

We are preparing a batched sample process for **Masked Language Modeling (MLM)** by setting `mlm=True`.

We also set the number of masked tokens to train `mlm_probability=0.15`. This will determine the percentage of tokens masked during the pretraining process.


We now initialize data_collator with our tokenizer, MLM activated, and the
proportion of masked tokens set to 0.15:

In [22]:
# Defining a Data Collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

## Step 12: Initializing the Trainer

We can now initialize the trainer. For educational purposes, the program
trains the model quickly. The number of epochs is limited to one. The GPU comes in handy since we can share the batches and multi-process the training tasks

In [23]:
# Initializing the Trainer
training_args = TrainingArguments(output_dir="./KantaiBERT", 
                                  overwrite_output_dir=True, 
                                  num_train_epochs=1, 
                                  per_device_train_batch_size=64,
                                  save_steps=10_000, 
                                  save_total_limit=2)

trainer = Trainer(model=model, args=training_args, data_collator=data_collator, train_dataset=dataset)

The model is now ready for training.

## Step 13: Pre-training the Model

Everything is ready. The trainer is launched with one line of code:

In [24]:
# Pre-training the Model
%%time

trainer.train()

Step,Training Loss
500,5.4742
1000,4.0257
1500,3.7324
2000,3.5148
2500,3.402


CPU times: user 8min 35s, sys: 4min 27s, total: 13min 2s
Wall time: 7min 7s


TrainOutput(global_step=2672, training_loss=3.9887362268870463, metrics={'train_runtime': 426.5652, 'train_samples_per_second': 6.264, 'total_flos': 1689347110470912.0, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 2077216768, 'init_mem_gpu_alloc_delta': 334180352, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 15331328, 'train_mem_gpu_alloc_delta': 1010226176, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 2583970816})

The output displays the training process in real time showing the loss, learning rate, epoch, and steps.

## Step 14: Saving the Final Model(+tokenizer + config) to disk

The model has been trained. It's time to save our work.

We will now save the model and configuration:

In [25]:
# Saving the Final Model(+tokenizer + config) to disk
trainer.save_model("./KantaiBERT")

`config.json, pytorh_model.bin`, and `training_args.bin` should now appear in the file manager.

`merges.txt` and `vocab.json` contain the pretrained tokenization of the dataset.

## Step 15: Language Modeling with the FillMaskPipeline

We have built a model from scratch.

Let's import the pipeline to perform a language modeling task with our pretrained model and tokenizer.

We will now import a language modeling fill-mask task. We will use our trained
model and trained tokenizer to perform masked language modeling:

In [26]:
# Language Modeling with the FillMaskPipeline
fill_mask = pipeline("fill-mask", model="./KantaiBERT", tokenizer="./KantaiBERT")

We can now ask our model to think like Immanuel Kant:

In [28]:
fill_mask("Human thinking involves human <mask>.")

[{'score': 0.040765952318906784,
  'sequence': 'Human thinking involves human reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.017752638086676598,
  'sequence': 'Human thinking involves human conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.01720724254846573,
  'sequence': 'Human thinking involves human understanding.',
  'token': 600,
  'token_str': ' understanding'},
 {'score': 0.013491691090166569,
  'sequence': 'Human thinking involves human experience.',
  'token': 531,
  'token_str': ' experience'},
 {'score': 0.012612631544470787,
  'sequence': 'Human thinking involves human time.',
  'token': 526,
  'token_str': ' time'}]

The output will likely change after each run because we are pretraining the model from scratch with a limited amount of data. However, the output obtained in this run is interesting because it introduces conceptional language modeling.

In [29]:
fill_mask("How human show love <mask>.")

[{'score': 0.013027912005782127,
  'sequence': 'How human show love reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.012774677015841007,
  'sequence': 'How human show love is.',
  'token': 300,
  'token_str': ' is'},
 {'score': 0.012741762213408947,
  'sequence': 'How human show love,.',
  'token': 16,
  'token_str': ','},
 {'score': 0.010261932387948036,
  'sequence': 'How human show love conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.008668892085552216,
  'sequence': 'How human show love itself.',
  'token': 500,
  'token_str': ' itself'}]

In [30]:
fill_mask("What is wealth of nations <mask>.")

[{'score': 0.02068767324090004,
  'sequence': 'What is wealth of nations reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.018951253965497017,
  'sequence': 'What is wealth of nations conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.008094606921076775,
  'sequence': 'What is wealth of nations conception.',
  'token': 418,
  'token_str': ' conception'},
 {'score': 0.007942058145999908,
  'sequence': 'What is wealth of nations experience.',
  'token': 531,
  'token_str': ' experience'},
 {'score': 0.00781000591814518,
  'sequence': 'What is wealth of nations unity.',
  'token': 688,
  'token_str': ' unity'}]

The predictions might vary at each run and each time Hugging Face updates its
models.

However, the following output comes out often:

```
Human thinking involves human reason
```

**The goal here was to see how to train a transformer model. We can see that very interesting humanlike predictions can be made.**

These results are experimental and subject to variations during the training process. They will change each time we train the model again.

The model would require much more data from other `Age of Enlightenment thinkers`.

**However, the goal of this model is to show that we can create datasets to train a transformer for a specific type of complex language modeling task.**

Thanks to the Transformer, we are only at the beginning of a new era of AI!