<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Masked%20Language%20Modeling%20using%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Casual Language Modeling using BERT**

**Language modeling** predicts words in a sentence. There are two forms of language modeling.
- **Causal Language Modeling** predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. `distilgpt2`
- **Masked Language Modeling** predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. `distilroberta-base`






In [52]:
%pip install -qq datasets

**Load Dataset**

In [53]:
from datasets import load_dataset
eli5 = load_dataset("eli5", split="train_asks[:5000]")

Reusing dataset eli5 (/root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)


**Train and Test Split**

In [54]:
eli5 = eli5.train_test_split(test_size=0.2)

In [55]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
        num_rows: 1000
    })
})

In [56]:
eli5["train"][0], eli5["test"][0] 

({'answers': {'a_id': ['c8opnjk'],
   'score': [4],
   'text': ['Humans traded brute strength for fine motor control.\n\n_URL_0_']},
  'answers_urls': {'url': ['http://www.mentalindigestion.net/?p=266']},
  'document': '',
  'q_id': '19jcf0',
  'selftext': '',
  'selftext_urls': {'url': []},
  'subreddit': 'askscience',
  'title': 'How is it possible for orangoutangs of comparable weight to be up to five times stronger than their human counterpart? ',
  'title_urls': {'url': []}},
 {'answers': {'a_id': ['c2v8s37', 'c2v9n7c', 'c2v9440'],
   'score': [5, 4, 2],
   'text': ['Neither. You want to drive a wedge into the middle of the handle so it is pressing outward with tighter force on the metal axe head. \n\n[Here](_URL_0_) you can see an axe head that has been repaired a couple times. First with a wedge and then with some dowels.',
    "It is better to hit the handle on the ground because then the head can be driven by momentum down past the top of the handle. If you hit the head on the

**Extract Text**

**Flatten** the dataset for easy extraction. We will be able to extract the data like `answers.text` instead of `["answers"]["text"]`

In [57]:
eli5 = eli5.flatten()

In [58]:
eli5["train"]["answers.text"][0]

['Humans traded brute strength for fine motor control.\n\n_URL_0_']

**Preprocess**

In [59]:
%pip install -qq transformers

In [76]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Using pad_token, but it is not set yet.


In [77]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

In [78]:
tokenized_eli5 = eli5.map(preprocess_function, batched=True, num_proc=4, remove_columns=eli5["train"].column_names)

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

       

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [79]:
tokenized_eli5

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [80]:
print(tokenized_eli5["train"]["input_ids"][0])

[32661, 504, 14018, 33908, 4202, 329, 3734, 5584, 1630, 13, 198, 198, 62, 21886, 62, 15, 62]


**Capture Truncated Text**
When we tokenize texts the tokenizer truncates some of the texts based on default size. So we need a second preprocessing function to capture text truncated from any lengthy examples to prevent loss of information. This preprocessing function should:

- Concatenate all the text.
- Split the concatenated text into smaller chunks defined by block_size.

In [81]:
BLOCK_SIZE = 128

def group_text(examples):
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  result = {
      k: [t[i: i+ BLOCK_SIZE] for i in range(0, total_length, BLOCK_SIZE)]
      for k, t in concatenated_examples.items()
      }
  result["labels"] = result["input_ids"].copy()
  return result

In [82]:
lm_dataset = tokenized_eli5.map(group_text, batched=True, num_proc=4)

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

In [83]:
lm_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 8516
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2104
    })
})

In [84]:
print(lm_dataset["train"]["input_ids"][0])

[32661, 504, 14018, 33908, 4202, 329, 3734, 5584, 1630, 13, 198, 198, 62, 21886, 62, 15, 62, 2949, 13, 1002, 345, 12854, 597, 705, 2971, 26842, 6, 290, 7110, 340, 355, 257, 1627, 11, 340, 481, 1061, 262, 1627, 3446, 645, 2300, 543, 835, 340, 17781, 326, 1627, 11, 340, 338, 530, 286, 262, 16636, 7811, 286, 36237, 13, 198, 198, 2396, 326, 6209, 1724, 11, 1997, 326, 16701, 7689, 1657, 16701, 7689, 340, 2035, 4571, 13, 1026, 318, 1744, 475, 314, 716, 21254, 7114, 771, 13, 220, 220, 198, 198, 13300, 484, 547, 9759, 284, 1635, 34, 619, 5172, 14922, 377, 36639, 1874, 16241, 3780, 9, 29565, 9291, 287, 7672, 1741, 11, 4793, 16151, 62, 21886, 62, 15, 62, 4008, 543, 318, 262, 779, 286, 31543, 48308]


In [85]:
print(lm_dataset["train"]["labels"][0])

[32661, 504, 14018, 33908, 4202, 329, 3734, 5584, 1630, 13, 198, 198, 62, 21886, 62, 15, 62, 2949, 13, 1002, 345, 12854, 597, 705, 2971, 26842, 6, 290, 7110, 340, 355, 257, 1627, 11, 340, 481, 1061, 262, 1627, 3446, 645, 2300, 543, 835, 340, 17781, 326, 1627, 11, 340, 338, 530, 286, 262, 16636, 7811, 286, 36237, 13, 198, 198, 2396, 326, 6209, 1724, 11, 1997, 326, 16701, 7689, 1657, 16701, 7689, 340, 2035, 4571, 13, 1026, 318, 1744, 475, 314, 716, 21254, 7114, 771, 13, 220, 220, 198, 198, 13300, 484, 547, 9759, 284, 1635, 34, 619, 5172, 14922, 377, 36639, 1874, 16241, 3780, 9, 29565, 9291, 287, 7672, 1741, 11, 4793, 16151, 62, 21886, 62, 15, 62, 4008, 543, 318, 262, 779, 286, 31543, 48308]


For **Causal Language Modeling**, use `DataCollatorForLanguageModeling` to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

In [86]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

**Train**

To **fine-tune** a model in TensorFlow, start by converting your datasets to the tf.data.Dataset format with to_tf_dataset. Specify inputs and labels in columns, whether to shuffle the dataset order, batch size, and the data collator:

In [87]:
tf_train_set = lm_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = lm_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

**Optimizer**

In [88]:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

**Model**

In [89]:
from transformers import TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base")

Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


**Compile**

In [90]:
import tensorflow as tf
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


**Fit**

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
 16/532 [..............................] - ETA: 3:29:38 - loss: 4.1331