# Bert Pre-Training from scratch 

## Dataset 

One can loggin to Hugging Face to interact with its hub

- There, you can also save repositories, upload datasets, upload models and more

- For now, I will save up everything locally

In [38]:
# check enabled GPU

import torch

# torch.zeros(1).cuda()

torch.cuda.is_available()

True

Using dataset from wikipedia pages (2023) (English)

- each document (page in wikipedia) is save in a row 

- the wikipedia dataset has several columns, including title and etc. I only used the main text of the page

- I tried using BookCorpus dataset also. But it became too much for the training, so this is something to consider afterwards

- Luckly, downloading dataset from hugging face only needs to be done once, and it will be saved in cache for future loads!

In [39]:
from datasets import load_dataset, concatenate_datasets
from tqdm import tqdm

wikipedia = load_dataset("wikimedia/wikipedia", "20231101.en", split="train")
# bookcorpus = load_dataset("bookcorpus", split="train")

wikipedia = wikipedia.remove_columns([col for col in wikipedia.column_names if col != "text"])  # only keep the 'text' column

#assert bookcorpus.features.type == wikipedia.features.type

#raw_datasets = concatenate_datasets([bookcorpus, wikipedia])

# def remove_non_ascii(example):
    # example["text"] = example["text"].encode("ascii", errors="ignore").decode()

# raw_datasets = wikipedia.map(remove_non_ascii)

raw_datasets = wikipedia

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/41 [00:00<?, ?it/s]

The wikipedia dataset has 6,4 M documents

In [40]:
raw_datasets

Dataset({
    features: ['text'],
    num_rows: 6407814
})

## Training a tokenizer 

Train a tokenizer, starting from a pre-trained configuration of BERT from hugging face

- It is important to train a tokenizer, because it is the responsible for representing the input data so the model is able to interact and interpret the vocabulary provided!

- each word used in the input, should have a token (or a sequence of tokens) that will be used to represent it, and become embeddings when interacting with the model

- the pre-loaded tokenizer is uncased, meaning that all uppercase letters will be converted to lowercase to reduce complexity and vocabulary size

It will contain the following special tokens:

    [UNK]: Unknown
    [SEP]: Separator (for sequences)
    [PAD]: Padding (to fill empty spots)
    [CLS]: Classification (initial token used as classifier)
    [MASK]: Masking (token that represents a masked token)


Train using a batch iterator that will split the dataset into batches of size 10_000.

And load the configurations of the pre configured **bert-base-uncased**

In [41]:
from transformers import BertTokenizerFast

# create a python generator to dynamically load the data, one batch at a time
def batch_iterator(batch_size=10_000):
    for i in tqdm(range(0, len(raw_datasets), batch_size)):
        yield raw_datasets[i : i + batch_size]["text"]

# load a tokenizer from existing one to re-use special tokens

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

The pre-loaded tokenizer already has a vocabulary_size (unique tokens) of size 30522

I will increase it to 35_000 so it can learn some more tokens from the input

In [42]:
vocabulary_size = 35_000

Train the tokenizer: *(skipped, because this block of code was executed only once)*

Save the tokenizer: *(skipped, because this block of code was executed only once)*

Load locally

In [43]:
tokenizer = BertTokenizerFast.from_pretrained("tokenizers/35_000", local_files_only=True)

tokenizer

BertTokenizerFast(name_or_path='tokenizers/35_000', vocab_size=35000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Test tokenizer with a sample text 

As you can see, the input string will be splitted into several token (saved in the vocabulary) and each token has a unique ID

Open: ***/tokenizers/35_000/vocab.txt*** to see more

In [44]:
sample = '''
Can you can a can as a canner can can a can?
'''

encoding = tokenizer.encode(sample)

print(encoding)

print(tokenizer.convert_ids_to_tokens(encoding))

[2, 25281, 26019, 25281, 43, 25281, 25041, 43, 28445, 24988, 25281, 25281, 43, 25281, 35, 3]
['[CLS]', 'can', 'you', 'can', 'a', 'can', 'as', 'a', 'cann', '##er', 'can', 'can', 'a', 'can', '?', '[SEP]']


The bert-base-uncased tokenizer was configured to use a {model_max_length} of 512, meaning that the sequence input given to the bert model can have up to 512 tokens (context size).

Nevertheless, with a GTX 1070, the memory of the video card was not enough for this context size. 

Therefore, I will reduce it to 128 and truncate each document in the dataset to have upto 128 tokens each

In [45]:
# hard limit the size of context
tokenizer.model_max_length = 128

*(skipped, because this block of code was executed only once)*

(The prompt above is bugged, it should have appered below) *(skipped, because this block of code was executed only once)*

Save the dataset, now, already tokenized, locally. *(skipped, because this block of code was executed only once)*

And always load the tokenized dataset from the training 

In [46]:
from datasets import load_from_disk

# load tokenized dataset locally
tokenized_datasets = load_from_disk(f"dataset/tokenized-train/{tokenizer.model_max_length}")

print(tokenized_datasets.features)

tokenized_datasets

{'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'special_tokens_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 6407814
})

In [47]:
print(tokenizer.convert_ids_to_tokens( tokenized_datasets[0]["input_ids"] )) 

['[CLS]', 'ana', '##r', '##chi', '##s', '##m', 'is', 'a', 'political', 'philosophy', 'and', 'movement', 'that', 'is', 'sk', '##ep', '##tic', '##al', 'of', 'all', 'just', '##ifications', 'for', 'authority', 'and', 'seek', '##s', 'to', 'ab', '##olis', '##h', 'the', 'institutions', 'it', 'claims', 'maintain', 'un', '##ne', '##cess', '##ary', 'co', '##erc', '##ion', 'and', 'hier', '##arch', '##y', ',', 'typically', 'including', 'nation', '-', 'states', ',', 'and', 'capital', '##ism', '.', 'ana', '##r', '##chi', '##s', '##m', 'advocate', '##s', 'for', 'the', 'replacement', 'of', 'the', 'state', 'with', 'state', '##less', 'societies', 'and', 'vol', '##unt', '##ary', 'free', 'associations', '.', 'as', 'a', 'historically', 'left', '-', 'wing', 'movement', ',', 'this', 'reading', 'of', 'ana', '##r', '##chi', '##s', '##m', 'is', 'placed', 'on', 'the', 'far', '##th', '##est', 'left', 'of', 'the', 'political', 'spect', '##rum', ',', 'usually', 'described', 'as', 'the', 'libert', '##arian', 'wing',

In [48]:
from transformers import BertConfig

config = BertConfig.from_pretrained("bert-base-uncased")

config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.45.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [49]:
# diminish the specs, making is faster for my slow GPU

config.vocab_size = vocabulary_size
config.num_hidden_layers = 4
config.num_attention_heads = 8
config.intermediate_size = 1024
config.hidden_size = 256
config.max_position_embeddings = 128

config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 128,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.45.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 35000
}

In [50]:
from transformers import BertForMaskedLM

model = BertForMaskedLM(config=config)

print(model.num_parameters())

model

12254136


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(35000, 256, padding_idx=0)
      (position_embeddings): Embedding(128, 256)
      (token_type_embeddings): Embedding(2, 256)
      (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-3): 4 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=256, out_features=256, bias=True)
              (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise

Masked Language Modeling Task (MLM) training

In [51]:
from transformers import DataCollatorForLanguageModeling

# mask 15% of the tokens
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm = True,
    mlm_probability=0.15
)

In [52]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='training/model2',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=5_000,
    save_total_limit=2,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets,
)

Model 1:

    step 500: 10.345200

    step 85000: 6.5
    
    step 200000: 6.331900

In [53]:
trainer.train()

trainer.save_model("trained/model2")

Step,Training Loss
500,8.0596
1000,7.1348
1500,7.0017
2000,6.8927
2500,6.8298
3000,6.7642
3500,6.712
4000,6.647
4500,6.5946
5000,6.5411


In [55]:
from transformers import pipeline

model = BertForMaskedLM.from_pretrained('trained/model2/')

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer,
    device=0
)

test1 = "I didn't undestand the [MASK], I will study harder."

fill_mask(test1)

[{'score': 0.05094342678785324,
  'token': 51,
  'token_str': 'i',
  'sequence': "i didn ' t undestand the i, i will study harder."},
 {'score': 0.039161842316389084,
  'token': 25433,
  'token_str': 'world',
  'sequence': "i didn ' t undestand the world, i will study harder."},
 {'score': 0.01930907741189003,
  'token': 25239,
  'token_str': 'year',
  'sequence': "i didn ' t undestand the year, i will study harder."},
 {'score': 0.018123740330338478,
  'token': 25851,
  'token_str': 'name',
  'sequence': "i didn ' t undestand the name, i will study harder."},
 {'score': 0.01271581370383501,
  'token': 25405,
  'token_str': 'time',
  'sequence': "i didn ' t undestand the time, i will study harder."}]

In [56]:
fill_mask("Good girls, like bad [MASK]")

[{'score': 0.6839563250541687,
  'token': 28583,
  'token_str': 'girls',
  'sequence': 'good girls, like bad girls'},
 {'score': 0.05937695503234863,
  'token': 28930,
  'token_str': 'boys',
  'sequence': 'good girls, like bad boys'},
 {'score': 0.007571056485176086,
  'token': 25701,
  'token_str': 'women',
  'sequence': 'good girls, like bad women'},
 {'score': 0.007524473592638969,
  'token': 31362,
  'token_str': 'saints',
  'sequence': 'good girls, like bad saints'},
 {'score': 0.006937911733984947,
  'token': 26545,
  'token_str': '##ness',
  'sequence': 'good girls, like badness'}]

Train for the NSP task now

nevermind, I should have trained both simuoutaneuly, careful with forgetting

In [None]:
from transformers import BertForNextSentencePrediction, Datacollatorf

model = BertForNextSentencePrediction.from_pretrained("trained/model2")

data_collator = DataCollatorForLanguageModeling

Some weights of BertForNextSentencePrediction were not initialized from the model checkpoint at trained/model2 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
