<a href="https://colab.research.google.com/github/harshildarji/Machine-Learning/blob/master/Transformers/EsperBERTo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### This is from a tutorial from [Hugging Face](https://huggingface.co/blog/how-to-train) 🤗.

#### Get dataset and required libraries

In [1]:
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2022-02-06 13:06:10--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 54.192.20.121, 54.192.20.22, 54.192.20.117, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|54.192.20.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‘oscar.eo.txt’


2022-02-06 13:06:29 (17.2 MB/s) - ‘oscar.eo.txt’ saved [312733741/312733741]



In [2]:
!pip uninstall -y tensorflow
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

Found existing installation: tensorflow 2.7.0
Uninstalling tensorflow-2.7.0:
  Successfully uninstalled tensorflow-2.7.0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-qyyui4gf
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-qyyui4gf
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 4.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 7.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux

#### Train a tokenizer

In [3]:
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

In [4]:
path = [str(x) for x in Path('.').glob('**/*.txt')]

In [5]:
%%time
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=path, vocab_size=52_000, min_frequency=2, special_tokens=[
    '<s>',
    '<pad>',
    '</s>',
    '<unk>',
    '<mask>',
])

CPU times: user 6min 2s, sys: 4.78 s, total: 6min 7s
Wall time: 1min 46s


Save `tokenizer`

In [6]:
!mkdir EsperBERTo
tokenizer.save_model('EsperBERTo')

['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

In [7]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

In [8]:
tokenizer = ByteLevelBPETokenizer('./EsperBERTo/vocab.json', './EsperBERTo/merges.txt')

In [9]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ('</s>', tokenizer.token_to_id('</s>')), 
    ('<s>', tokenizer.token_to_id('<s>')),
)
tokenizer.enable_truncation(max_length=512)

In [10]:
tokenizer.encode('Mi estas Julien.')

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [11]:
tokenizer.encode('Mi estas Julien.').tokens

['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

#### Train language model

In [12]:
!nvidia-smi

Sun Feb  6 13:08:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [13]:
import torch
torch.cuda.is_available()

True

In [14]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000, 
    max_position_embeddings=514, 
    num_attention_heads=12, 
    num_hidden_layers=6, 
    type_vocab_size=1
)

In [15]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained('./EsperBERTo', max_len=512)

file ./EsperBERTo/config.json not found
file ./EsperBERTo/config.json not found


In [16]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [17]:
model.num_parameters()

83504416

In [18]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path='./oscar.eo.txt', block_size=128)



CPU times: user 5min, sys: 7.35 s, total: 5min 7s
Wall time: 1min 48s


In [19]:
from transformers import DataCollatorForLanguageModeling

data_collector = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

In [20]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./EsperBERTo/', 
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collector,
    train_dataset=dataset
)

In [21]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 974545
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 15228
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,7.8627
1000,7.2611
1500,7.0761
2000,6.9717
2500,6.8737
3000,6.7918
3500,6.7569
4000,6.7079
4500,6.6511
5000,6.5827


Saving model checkpoint to ./EsperBERTo/checkpoint-10000
Configuration saved in ./EsperBERTo/checkpoint-10000/config.json
Model weights saved in ./EsperBERTo/checkpoint-10000/pytorch_model.bin


Step,Training Loss
500,7.8627
1000,7.2611
1500,7.0761
2000,6.9717
2500,6.8737
3000,6.7918
3500,6.7569
4000,6.7079
4500,6.6511
5000,6.5827




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 2h 48min 48s, sys: 3min 2s, total: 2h 51min 50s
Wall time: 2h 51min 6s


TrainOutput(global_step=15228, training_loss=6.0911533925863335, metrics={'train_runtime': 10266.2503, 'train_samples_per_second': 94.927, 'train_steps_per_second': 1.483, 'total_flos': 3.231256266892493e+16, 'train_loss': 6.0911533925863335, 'epoch': 1.0})

Save model

In [22]:
trainer.save_model('./EsperBERTo/')

Saving model checkpoint to ./EsperBERTo/
Configuration saved in ./EsperBERTo/config.json
Model weights saved in ./EsperBERTo/pytorch_model.bin


#### Test language model

In [23]:
from transformers import pipeline

In [24]:
fill_mask = pipeline(
    'fill-mask',
    model='./EsperBERTo/',
    tokenizer='./EsperBERTo/'
)

loading configuration file ./EsperBERTo/config.json
Model config RobertaConfig {
  "_name_or_path": "./EsperBERTo/",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.17.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./EsperBERTo/config.json
Model config RobertaConfig {
  "_name_or_path": "./EsperBERTo/",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_d

In [25]:
fill_mask('La suno <mask>.')

[{'score': 0.0064928182400763035,
  'sequence': 'La suno estas.',
  'token': 316,
  'token_str': ' estas'},
 {'score': 0.005978148430585861,
  'sequence': 'La suno okuloj.',
  'token': 2574,
  'token_str': ' okuloj'},
 {'score': 0.003927275072783232,
  'sequence': 'La suno kapon.',
  'token': 4094,
  'token_str': ' kapon'},
 {'score': 0.0035268838983029127,
  'sequence': 'La suno vizaĝo.',
  'token': 4051,
  'token_str': ' vizaĝo'},
 {'score': 0.0031614797189831734,
  'sequence': 'La suno suno.',
  'token': 3938,
  'token_str': ' suno'}]

In [26]:
fill_mask("Jen la komenco de bela <mask>.")

[{'score': 0.010006673634052277,
  'sequence': 'Jen la komenco de bela tago.',
  'token': 1633,
  'token_str': ' tago'},
 {'score': 0.008677692152559757,
  'sequence': 'Jen la komenco de bela tempo.',
  'token': 1021,
  'token_str': ' tempo'},
 {'score': 0.008421780541539192,
  'sequence': 'Jen la komenco de bela mondo.',
  'token': 945,
  'token_str': ' mondo'},
 {'score': 0.006265346426516771,
  'sequence': 'Jen la komenco de bela jaroj.',
  'token': 757,
  'token_str': ' jaroj'},
 {'score': 0.005135939922183752,
  'sequence': 'Jen la komenco de bela vivo.',
  'token': 1160,
  'token_str': ' vivo'}]