This Notebook will cover :
-  How to train a tokenizer from scratch
- Saving the trained tokenizer to files
- Recreating the tokenizer for the pretraining process
- Initializing a RoBERTa model from scratch
- Exploring the configuration of the model
- Building the dataset for the trainer
- Initializing the trainer
- Pretraining the model
- Saving the model
- Applying the model to the downstream tasks of masked language modeling

In [None]:
#Installing Hugging Face Transformers
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

Found existing installation: tensorflow 2.5.0
Uninstalling tensorflow-2.5.0:
  Successfully uninstalled tensorflow-2.5.0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-pawv5hmb
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-pawv5hmb
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 39.8 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63

In [None]:
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]


CPU times: user 1.08 ms, sys: 0 ns, total: 1.08 ms
Wall time: 2.27 ms


In [None]:
paths

['taniya.txt']

In [None]:
#initializes tokenizer 
tokenizer = ByteLevelBPETokenizer()
#custom training 
tokenizer.train(files = paths,vocab_size = 52_000,min_frequency = 2,special_tokens = [    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",])

In [None]:
#saving file 
import os 
token_dir = '/content/TaniyaBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model('TaniyaBERT')

['TaniyaBERT/vocab.json', 'TaniyaBERT/merges.txt']

In [None]:
#Loding the trained tokenizer 
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
tokenizer = ByteLevelBPETokenizer(
    "./TaniyaBERT/vocab.json",
    "./TaniyaBERT/merges.txt",
)

In [None]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

The tokenizer now processes the tokens to fit the BERT model . The post processor will add a start and end token

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
#checking cuda
import torch 
torch.cuda.is_available()

True

In [None]:
#defining configuration of model 
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [None]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



save config variable in a json file in side TaniyaBert directory

In [None]:
#Recreating tokenizer 
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./TaniyaBERT",max_length = 512)

In [None]:
#Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [None]:
print(model.num_parameters())

83504416


In [None]:
#Building the Dataset
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./taniya.txt",
    block_size=128,
)



CPU times: user 27.3 s, sys: 438 ms, total: 27.8 s
Wall time: 27.7 s


We need to run a data collator before initializing the trainer. A data collator will take 
samples from the dataset and collate them into batches. The results are dictionarylike objects.
We are preparing a batched sample process for Masked Language Modeling (MLM) 
by setting mlm=True.
We also set the number of masked tokens to train mlm_probability=0.15. This will 
determine the percentage of tokens masked during the pretraining process.
We now initialize data_collator with our tokenizer, MLM activated, and the 
proportion of masked tokens set to 0.15

In [None]:
#Defining a Data Collator
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
#Initializing the Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./TaniyaBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [None]:
#Pre-training the Model
%%time
trainer.train()

***** Running training *****
  Num examples = 170964
  Num Epochs = 1
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2672


Step,Training Loss
500,6.5924
1000,5.7354
1500,5.2655
2000,5.0082
2500,4.8569




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 10min 23s, sys: 3.16 s, total: 10min 26s
Wall time: 10min 27s


TrainOutput(global_step=2672, training_loss=5.448299407958984, metrics={'train_runtime': 627.8891, 'train_samples_per_second': 272.284, 'train_steps_per_second': 4.256, 'total_flos': 873620128952064.0, 'train_loss': 5.448299407958984, 'epoch': 1.0})

In [None]:
#Saving the Final Model(+tokenizer + config) to disk
trainer.save_model("./TaniyaBERT")

Saving model checkpoint to ./TaniyaBERT
Configuration saved in ./TaniyaBERT/config.json
Model weights saved in ./TaniyaBERT/pytorch_model.bin


##USE THE TRAINED MODEL

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./TaniyaBERT",
    tokenizer="./TaniyaBERT"
)

loading configuration file ./TaniyaBERT/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.10.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./TaniyaBERT/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
 

In [None]:
fill_mask("Human thinking involves<mask>.") #Ġ stands for blank space

[{'score': 0.014007464051246643,
  'sequence': 'Human thinking involves experience.',
  'token': 531,
  'token_str': ' experience'},
 {'score': 0.010526700876653194,
  'sequence': 'Human thinking involves reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.010396560654044151,
  'sequence': 'Human thinking involves conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.008035563863813877,
  'sequence': 'Human thinking involves I.',
  'token': 364,
  'token_str': ' I'},
 {'score': 0.007603162433952093,
  'sequence': 'Human thinking involves it.',
  'token': 306,
  'token_str': ' it'}]