# **How to pretrain a language model (like BERT) with your own corpus from scratch**

In 2017 google published a paper titled as '[Attention is all you need](https://arxiv.org/abs/1706.03762)', where they proposed a new type of encoder-decoder architecture that outdid all the previous sequence-2-sequence models in various natural language processing(NLP) tasks such as machine translation, question answering, named entity recognition(NER) etc. 

In 2019 google introduced BERT- **Bidirectional Encoder Representations from Transformers** ([paper](https://arxiv.org/abs/1810.04805)),  which is designed to pre-train a language model from a vast corpus of rew text. What distinguishes it from existing word-embedding models like **Word2vec**, **ELMo** etc. is that it is a **truly bidirectional model**, meaning it is trained on unlabeled text by jointly conditioning **both left and right context simultaneously**. 

(Transformers? BERT? What is up with the architecture names google? what's next? Avengers?)

I will put down some links for you to get a clear concept on transformers if you haven't already :
* The paper [Attention is all you need](https://arxiv.org/abs/1706.03762)
* Jay Alammar's post [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
* [Transformer explained](https://youtu.be/z1xs9jdZnuY?list=WL) by Minsuk heo
* [BERT research series](https://www.youtube.com/playlist?list=PLam9sigHPGwOBuH4_4fr-XvDbe5uneaf6) by ChrisMcCormick


In this notebook, we will be using [tokenizers](https://github.com/huggingface/tokenizers) and [transformers](https://huggingface.co/transformers/) library.

Pre-requisites : `torch`, `tensorflow`

In [None]:
!pip install torch
!pip install tokenizers
!pip install transformers

# **Dataset**
You can use your own text corpus or you can download one from [OSCAR](https://oscar-corpus.com/), these are huge multilingual corpora obtained by language classification and filtering of Common Crawl dumps of the Web.

One thing to keep in mind, you will get better results by pretraining your data on more and more data.

If you are using your own corpus, make sure that your text corpus is **one sentence-per-line** like this:

```
Mr. Cassius crossed the highway, and stopped suddenly.
Something glittered in the nearest red pool before him.
Gold, surely!
But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible.
Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
Like most of his fellow gold-seekers, Cass was superstitious.
```

I am using a text corpus (almost 500MB containing around 2.2M sentences and 30M words) of random newspaper articles in Bangla (A less resourceful language), cause we all know there are several pretrained English BERT language models released by Google ([BERT github repo](https://https://github.com/google-research/bert)) and these models provide SOA performance for various nlp tasks.

But you can follow this notebook for pretraining a model in any language of your choice.

# **Tokenization**

We will have to train our own tokenizer and build a vocabulary for our corpus. 
We will be choosing `BertWordPieceTokenizer` from `tokenizers` library. Arbitrarily choose a `vocab_size=50,000`. The model will be saved to the output directory as `'name-vocab.txt'` file. 

I had a pretrained tokenizer for Bangla, so I am using that.

In [None]:
# Train a tokenizer
import tokenizers
 
bwpt = tokenizers.BertWordPieceTokenizer(vocab_file=None)
 
filepath = "input file directory"

bwpt.train(
    files=[filepath],
    vocab_size=50000,
    min_frequency=3,
    limit_alphabet=1000
)

bwpt.save('/kaggle/working/', 'name')

In [1]:
# Load the tokenizer
from transformers import BertTokenizer, LineByLineTextDataset

vocab_file_dir = '/kaggle/input/bert-bangla/bangla-vocab.txt' 

tokenizer = BertTokenizer.from_pretrained(vocab_file_dir)

sentence = 'শেষ দিকে সেনাবাহিনীর সদস্যরা এসব ঘর তাঁর প্রশাসনের কাছে হস্তান্তর করেন'

encoded_input = tokenizer.tokenize(sentence)
print(encoded_input)
# print(encoded_input['input_ids'])



['শেষ', 'দিকে', 'সেনাবাহিনীর', 'সদসযরা', 'এসব', 'ঘর', 'তার', 'পরশাসনের', 'কাছে', 'হসতানতর', 'করেন']


In [2]:
%%time

'''
transformers has a predefined class LineByLineTextDataset()
which reads your text line by line and converts them to tokens
'''

dataset= LineByLineTextDataset(
    tokenizer = tokenizer,
    file_path = '/kaggle/input/bert-bangla/raw_bangla_for_BERT.txt',
    block_size = 128  # maximum sequence length
)

print('No. of lines: ', len(dataset)) # No of lines in your datset

No. of lines:  2172033
CPU times: user 20min 12s, sys: 4.96 s, total: 20min 17s
Wall time: 20min 26s


# **Defining model**
Now that have the training data ready to be fed into the model, let's define the model.
First we have to define the configuration of the BERT model.
`vocab_size` should be the size of your trained vocabulary. Keep the rest of the arguments as they are. I am expecting that you have a thorough knowledge on the transformers model to understand the parameters 

We will be using `BertForMaskedLM` from transformers library which is built on top of masked language modelling(MLM) excluding the next sentence prediction(NSP) task. 

You also need to define a `DataCollator`. What is `DataCollator` you ask?

A `DataCollator` is a function that takes a list of samples from a Dataset
and collate them into a batch, as a dictionary of Tensors.

* collates batches of tensors, honoring their tokenizer's pad_token
* preprocesses batches for masked language modeling

In [3]:
from transformers import BertConfig, BertForMaskedLM, DataCollatorForLanguageModeling

config = BertConfig(
    vocab_size=50000,
    hidden_size=768, 
    num_hidden_layers=6, 
    num_attention_heads=12,
    max_position_embeddings=512
)
 
model = BertForMaskedLM(config)
print('No of parameters: ', model.num_parameters())


data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

No of parameters:  82556240


# **Defining training arguments**
`per_device_train_batch_size` is theoretically not the same as the batch size for BERT model. This is true when you have more than 1 GPU/TPU.

But as of now in practicality, assuming that you are training the model on 1 GPU(In colab/your pc) `per_device_train_batch_size` is the bach size for your BERT model, which is I have set 32 (**recommended batch size for BERT in the paper =16 or 32**). 

Then instantiate a trainer with the predefined model, tokenizer, datacollator and dataset.


In [4]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='/kaggle/working/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

# **Train the model**
We are at the last step of our language model pretraining.
Call the trainer's `train()` method and sit back and watch a movie cause this is going to take a lot of time depending on your corpus size.

Remember **Google's BERT-base** was trained on **4 cloud TPUs for 4 uninterrupted days**. That is equivalent to **16 GPU days**!

I trained a model on a random newspaper article corpus of only 500MB containing around 2.2M sentences and 30M words and that took almost 4 hrs!

Don't forget to save the model! Cause you know, if you fall asleep (I am certain you will) and wake up and see runtime disconnected! RIP!

In [5]:
%%time
trainer.train()
trainer.save_model('/kaggle/working/')

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=67877.0, style=ProgressStyle(description_…

{"loss": 9.241739128112792, "learning_rate": 4.963168672746291e-05, "epoch": 0.007366265450741783, "step": 500}
{"loss": 8.781517308235168, "learning_rate": 4.9263373454925824e-05, "epoch": 0.014732530901483566, "step": 1000}
{"loss": 8.49215627193451, "learning_rate": 4.8895060182388734e-05, "epoch": 0.02209879635222535, "step": 1500}
{"loss": 8.30438306903839, "learning_rate": 4.8526746909851645e-05, "epoch": 0.02946506180296713, "step": 2000}
{"loss": 8.151245285987853, "learning_rate": 4.8158433637314555e-05, "epoch": 0.03683132725370891, "step": 2500}
{"loss": 8.093315546035766, "learning_rate": 4.7790120364777466e-05, "epoch": 0.0441975927044507, "step": 3000}
{"loss": 7.9972240304946896, "learning_rate": 4.7421807092240377e-05, "epoch": 0.05156385815519248, "step": 3500}
{"loss": 7.8790246143341065, "learning_rate": 4.705349381970329e-05, "epoch": 0.05893012360593426, "step": 4000}
{"loss": 7.836776434898376, "learning_rate": 4.6685180547166204e-05, "epoch": 0.06629638905667605,



{"loss": 6.917126598358155, "learning_rate": 4.226542127672113e-05, "epoch": 0.15469157446557744, "step": 10500}
{"loss": 6.840472091674805, "learning_rate": 4.1897108004184035e-05, "epoch": 0.16205783991631922, "step": 11000}
{"loss": 6.835779744148255, "learning_rate": 4.152879473164695e-05, "epoch": 0.169424105367061, "step": 11500}
{"loss": 6.775315885543823, "learning_rate": 4.116048145910986e-05, "epoch": 0.1767903708178028, "step": 12000}
{"loss": 6.678355553627014, "learning_rate": 4.0792168186572774e-05, "epoch": 0.18415663626854456, "step": 12500}
{"loss": 6.6327721853256225, "learning_rate": 4.0423854914035685e-05, "epoch": 0.19152290171928635, "step": 13000}
{"loss": 6.632081868171692, "learning_rate": 4.0055541641498595e-05, "epoch": 0.19888916717002814, "step": 13500}
{"loss": 6.549484587669372, "learning_rate": 3.9687228368961506e-05, "epoch": 0.20625543262076992, "step": 14000}
{"loss": 6.516517824172974, "learning_rate": 3.9318915096424416e-05, "epoch": 0.2136216980715

{"loss": 5.294890280723572, "learning_rate": 1.5378552381513622e-05, "epoch": 0.6924289523697276, "step": 47000}
{"loss": 5.281967339992523, "learning_rate": 1.5010239108976532e-05, "epoch": 0.6997952178204694, "step": 47500}
{"loss": 5.313121272563935, "learning_rate": 1.4641925836439443e-05, "epoch": 0.7071614832712112, "step": 48000}
{"loss": 5.29164058971405, "learning_rate": 1.4273612563902353e-05, "epoch": 0.7145277487219529, "step": 48500}
{"loss": 5.276333406448364, "learning_rate": 1.3905299291365264e-05, "epoch": 0.7218940141726947, "step": 49000}
{"loss": 5.26769612455368, "learning_rate": 1.3536986018828175e-05, "epoch": 0.7292602796234365, "step": 49500}
{"loss": 5.288669554233551, "learning_rate": 1.3168672746291085e-05, "epoch": 0.7366265450741782, "step": 50000}
{"loss": 5.279936078071594, "learning_rate": 1.2800359473753998e-05, "epoch": 0.7439928105249201, "step": 50500}
{"loss": 5.251193338871002, "learning_rate": 1.2432046201216906e-05, "epoch": 0.7513590759756619, 

# **Check your model's prediction**
Load your pretained model from the saved model directory and a make a pipeline for masked word prediction task. 


In [6]:
from transformers import pipeline

model = BertForMaskedLM.from_pretrained('/kaggle/working/')

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

In [19]:
fill_mask('লাশ উদ্ধার করে ময়নাতদন্তের জন্য কক্সবাজার [MASK] মর্গে পাঠিয়েছে পুলিশ')

[{'sequence': '[CLS] লাশ উদধার করে মযনাতদনতের জনয ককসবাজার সদর মরগে পাঠিযেছে পলিশ [SEP]',
  'score': 0.41483181715011597,
  'token': 1767},
 {'sequence': '[CLS] লাশ উদধার করে মযনাতদনতের জনয ককসবাজার হাসপাতাল মরগে পাঠিযেছে পলিশ [SEP]',
  'score': 0.35426250100135803,
  'token': 3517},
 {'sequence': '[CLS] লাশ উদধার করে মযনাতদনতের জনয ককসবাজার হাসপাতালের মরগে পাঠিযেছে পলিশ [SEP]',
  'score': 0.046040359884500504,
  'token': 4152},
 {'sequence': '[CLS] লাশ উদধার করে মযনাতদনতের জনয ককসবাজার মেডিকেল মরগে পাঠিযেছে পলিশ [SEP]',
  'score': 0.044507820159196854,
  'token': 2268},
 {'sequence': '[CLS] লাশ উদধার করে মযনাতদনতের জনয ককসবাজার কলেজ মরগে পাঠিযেছে পলিশ [SEP]',
  'score': 0.008816763758659363,
  'token': 1253}]

I will translate the sentence and the predictions for you to understand the results.

`fill_mask('The body was recovered and sent to Cox's Bazar [MASK] morgue for autopsy')`
The answer should be ***hospital***.

The model's prediction were (sorted by probability score):
`city, hospital, hospital's, medical, college `

In [25]:
fill_mask('১৯৭১ সালে বাংলাদেশ ৯ মাস মুক্তিযুদ্ধ করে [MASK] অর্জন করে')

[{'sequence': '[CLS] ১৯৭১ সালে বাংলাদেশ ৯ মাস মকতিযদধ করে পরসকার অরজন করে [SEP]',
  'score': 0.03773483261466026,
  'token': 2225},
 {'sequence': '[CLS] ১৯৭১ সালে বাংলাদেশ ৯ মাস মকতিযদধ করে সবাধীনতা অরজন করে [SEP]',
  'score': 0.029507821425795555,
  'token': 2825},
 {'sequence': '[CLS] ১৯৭১ সালে বাংলাদেশ ৯ মাস মকতিযদধ করে তা অরজন করে [SEP]',
  'score': 0.028433645144104958,
  'token': 622},
 {'sequence': '[CLS] ১৯৭১ সালে বাংলাদেশ ৯ মাস মকতিযদধ করে দেশ অরজন করে [SEP]',
  'score': 0.024446936324238777,
  'token': 656},
 {'sequence': '[CLS] ১৯৭১ সালে বাংলাদেশ ৯ মাস মকতিযদধ করে মকতিযদধ অরজন করে [SEP]',
  'score': 0.013458509929478168,
  'token': 2454}]

I will translate the sentence and the predictions for you to understand the results.

`fill_mask('In 1971, Bangladesh gained [MASK] by fighting for 9 months')`
The answer should be ***independece***.

The model's prediction were (sorted by probability score):
`prize, independence, that, country, war `

# **Conclusion**
My model did a fairly decent job! As I said earlier, BERT needs a ton of text to understand a language better. Google's BERT-base was trained on TeraBytes of raw text data containing around 3.3B words (around 110x of what we trained on).

I trained my model on random newspaper articles. Its better to train your BERT model on a domain specific text for your task. You will definitely get a better result in that domain.


So, Congratulations! 
You can now train your own BERT model in any language.

Now there might be a question arising in your mind.

***Can I train a model by using the weights from a pretrained model?***

Yes you can. Notice in the model defining section I defined the model in this manner, 

`model = BertForMaskedLM(config)`

Here the `BertConfig` is passed as the argument, instead of this what you have to do is `model = BertForMaskedLM.from_pretained('bert-base-cased')` 

or if you want to load the model from local directory
`model = BertForMaskedLM.from_pretained('your_model_directory')`

Thank you!

