# Robustly Optimized BERT Pretraining Approach (RoBERTa)

## Improvements over BERT

### Byte-Pair Encoding (BPE)
- Alternative to WordPiece tokenization (Word to embedding)
- Break down words into subcomponents:
    - Example: "I like smaller cats" could become ["I", "like", "small", "er", "cats"]
- Pros:
    - Allows the model understand words it hasn't seen before
    - Target vocabulary can be smaller

### DistilBERT
- Smaller model than RoBERTa with much better performance
- Results are slightly less accurate than RoBERTa

In [None]:
#@title I. Load Dataset and Setup GPU
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter04/kant.txt --output "kant.txt"

#@ Check Nvidia
import torch
!nvidia-smi
torch.cuda.is_available()

In [None]:
#@title II. Training a Byte-Level Tokenizer

#@markdown #### Benefits of Byte-Level Tokenization
#@markdown ##### - Allows for smaller target vocabulary
#@markdown ##### - Allows for use of OOV words, because it uses subcomponents of words
#@markdown #### Hugging Face
#@markdown ##### - Tokenization is saved as two files
#@markdown ######--> 'merges.txt': Merged tokenized substrings
#@markdown ######--> 'vocab.json': Indices of the tokenized substrings
import os
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

# Get all paths to the dataset
paths = [str(x) for x in Path('.').glob('**/*.txt')]

# Train the tokenizer
tokenizer.train(files=paths, vocab_size=52000, min_frequency=2, special_tokens=[
    '<s>',
    '<pad>',
    '</s>',
    '<unk>',
    '<mask>',
])

token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')

# It's weird but in the byte level tokenization,
# 'Ġ' means a whitespace character

In [None]:
#@title III. Load the Tokenizer

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    './KantaiBERT/vocab.json',
    './KantaiBERT/merges.txt',
)

# Add start and end token to the sentences.
tokenizer._tokenizer.post_processor = BertProcessing(
    ('</s>', tokenizer.token_to_id('</s>')),
    ('<s>', tokenizer.token_to_id('<s>')),    
)
tokenizer.enable_truncation(max_length=512)
tokenizer.encode("I like many animals, especially cats.").tokens

In [None]:
#@title Define RoBERTa Tokenizer and Model

from transformers import RobertaTokenizer
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM

tokenizer = RobertaTokenizer.from_pretrained('./KantaiBERT', max_length=512)

config = RobertaConfig(
    vocab_size = 52000,
    max_position_embeddings=514,
    num_attention_heads = 12,
    num_hidden_layers=6,
    type_vocab_size=1,
)
model = RobertaForMaskedLM(config=config)

# Print summary of model
# print(model)

# Print model parameters
# print(model.num_parameters())

In [None]:
#@title Build Dataset and Collator

from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path='./kant.txt',
    block_size=128, # Batch size
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
!pip install accelerate

In [None]:
#@title Initialize the Trainer

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir = './KantaiBERT',
    overwrite_output_dir = True,
    num_train_epochs = 1,
    per_device_train_batch_size = 64,
    save_steps = 10000,
    save_total_limit = 2,
)
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = dataset,
)

In [None]:
#@title Pre-Train the Model
%%time
trainer.train()

In [None]:
#@title Save Model (To Dir with Tokenizer and Config)
trainer.save_model('./KantaiBERT')

In [None]:
#@title Perform Fill-Mask
from transformers import pipeline
fill_mask = pipeline(
    'fill-mask',
    model = './KantaiBERT',
    tokenizer='./KantaiBERT',
)

In [None]:
from time import sleep
import random

sentence = 'The reason for human existence has less to do with'

while True:
  masked = f'{sentence} <mask>'
  values = fill_mask(masked)
  val_idx = random.randint(0, len(last_value)-1)
  last_value = values[val_idx]
  print(last_value)
  sentence += (last_value['token_str'])
  print(sentence)