**FINE TUNING BERT MODELS FOR MASKED LANGUAGE MODELING**

**Code reference**
* masked language modeling script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
* custom token masking example: https://www.analyticsvidhya.com/blog/2022/09/fine-tuning-bert-with-masked-language-modeling/

**Import libraries**

In [1]:
import pandas as pd
import numpy as np

import torch
from torch import cuda

from datasets import Dataset

from sklearn.model_selection import train_test_split

from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM
from transformers import DataCollatorForLanguageModeling
from transformers import DataCollatorForWholeWordMask
from transformers import TrainingArguments
from transformers import Trainer
from transformers import set_seed

import evaluate

from src.models import save_model_and_tokenizer

import warnings
import gc

  from pandas.core.computation.check import NUMEXPR_INSTALLED


**Define libraries parameters**

In [2]:
warnings.filterwarnings(action="ignore")
set_seed(seed=42)

**Check gpu info if available**

In [3]:
!nvidia-smi

Mon Feb 20 16:49:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:65:00.0 Off |                  N/A |
| 38%   61C    P2    57W / 250W |    991MiB / 11175MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

**Clean gpu memory cache if gpu is available and set torch device**

In [4]:
if cuda.is_available():
    gc.collect()
    cuda.empty_cache()
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
device

device(type='cuda')

**Define useful parameters**

In [5]:
#INPUT_FILEPATH = "data/processed/airbnb_london_20220910.parquet"
INPUT_FILEPATH = "data/processed/cso_v3.3.parquet"
INPUT_FILENAME = INPUT_FILEPATH.split("/")[-1].split(".")[0] if "cso_v3.3.parquet" not in INPUT_FILEPATH else "cso_v3.3"

PRETRAINED_MODEL_NAME_OR_PATH = "bert-base-uncased"
EPOCHS = 20
BATCH_SIZE = 16
METRIC = evaluate.load(path="accuracy")

MODEL_OUTPUT_DIR = f"models/{PRETRAINED_MODEL_NAME_OR_PATH}-{INPUT_FILENAME}-mask/"

**Load input file**

In [6]:
df = pd.read_parquet(path=INPUT_FILEPATH)
df.sample(n=5)

Unnamed: 0,text
22090,USB phone is related to Computer peripherals
11050,Irish Terrier is related to Breed standard (do...
11743,"Kentucky head city Frankfort, Kentucky"
1483,Badminton is related to Canada
4707,City Pulse is called City Pulse


**Split dataframe into training and validation sets**

In [7]:
df_train, df_valid = train_test_split(df, test_size=0.05, random_state=42)
print(f"training set size: {df_train.shape[0]}, validation set size: {df_valid.shape[0]}")

training set size: 24075, validation set size: 1268


**Convert pandas dataframe into huggingface dataset**

In [8]:
train_dataset = Dataset.from_pandas(df=df_train)
valid_dataset = Dataset.from_pandas(df=df_valid)

**Load model configuration**

In [9]:
config_kwargs = {
    "revision": "main",
}

config = AutoConfig.from_pretrained(
    pretrained_model_name_or_path=PRETRAINED_MODEL_NAME_OR_PATH,
    **config_kwargs
)
config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

**Load tokenizer**

In [10]:
tokenizer_kwargs = {
    "use_fast": True,
    "revision": "main"
}

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=PRETRAINED_MODEL_NAME_OR_PATH,
    **tokenizer_kwargs
)

**Load model**

In [11]:
model = AutoModelForMaskedLM.from_pretrained(
    pretrained_model_name_or_path=PRETRAINED_MODEL_NAME_OR_PATH,
    config=config,
    revision="main"
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Define function to tokenize texts**

In [12]:
def tokenize_function(example: dict, text_key: str = "text") -> dict:
    """Function to tokenize text examples.
    
    Parameters
    ----------
    example : dict
        Dict with the texts to tokenize.
    text_key : str, default='text'
        Key name with text values.
    
    Returns
    -------
    dict
        Input texts tokenized.
    
    """
    return tokenizer(
        text=example[text_key],
        return_special_tokens_mask=True
    )

**Example of tokenization on text**

In [13]:
example = {"text": ["Try to tokenize this!"]}
encoded_text = tokenize_function(example)
encoded_text

{'input_ids': [[101, 3046, 2000, 19204, 4697, 2023, 999, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1]], 'special_tokens_mask': [[1, 0, 0, 0, 0, 0, 0, 1]]}

In [14]:
tokenizer.convert_ids_to_tokens(ids=encoded_text["input_ids"][0])

['[CLS]', 'try', 'to', 'token', '##ize', 'this', '!', '[SEP]']

**Apply tokenization to the whole dataset**

In [15]:
tokenized_train_dataset = train_dataset.map(function=tokenize_function, batched=True, remove_columns=["text"])
tokenized_valid_dataset = valid_dataset.map(function=tokenize_function, batched=True, remove_columns=["text"])



  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

**Define function to group input texts**

In [16]:
def group_texts(examples: dict, max_seq_length: int = 128) -> dict:
    """Group input texts into chuncks.
    
    Parameters
    ----------
    examples : dict
        Dictionary with text examples to group.
    max_seq_length : int, default=512
        Max sequence length for grouping.
        
    Returns
    -------
    dict
        Input texts grouped.
        
    """
    # Concatenate all texts.
    concatenated_examples = {k: np.hstack(list(examples[k])) for k in examples.keys()}
    
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    
    if total_length >= max_seq_length:
        total_length = (total_length // max_seq_length) * max_seq_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
        for k, t in concatenated_examples.items()
    }
    return result

**Apply group operation on dataset**

In [17]:
lm_train_dataset = tokenized_train_dataset.map(function=group_texts, batched=True, batch_size=128)
lm_test_dataset = tokenized_valid_dataset.map(function=group_texts, batched=True, batch_size=128)

  0%|          | 0/189 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

**Define functions to compute metrics**

In [18]:
metric = evaluate.load(path="accuracy")


def preprocess_logits_for_metrics(logits, labels):
    """Preprocessing model logits for classification metrics."""
    if isinstance(logits, tuple):
        logits = logits[0]
    return logits.argmax(dim=-1)


def compute_metrics(eval_preds):
    """Compute classification metrics."""
    preds, labels = eval_preds
    labels = labels.reshape(-1)
    preds = preds.reshape(-1)
    mask = labels != -100
    labels = labels[mask]
    preds = preds[mask]
    return metric.compute(predictions=preds, references=labels)

**Define training arguments**

In [19]:
training_args = TrainingArguments(
    output_dir=MODEL_OUTPUT_DIR,
    evaluation_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE,
    learning_rate=5e-05,
    weight_decay=0.01,
    num_train_epochs=EPOCHS,
    save_strategy="epoch",
    seed=42,
    load_best_model_at_end=True
)

**Define data collator for masked language modeling**  
reference: https://github.com/huggingface/transformers/blob/c836f77266be9ace47bff472f63caf71c0d11333/src/transformers/data/data_collator.py#L609

In [20]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=True, 
    mlm_probability=0.15
)

**Define trainer object for model training**

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_train_dataset,
    eval_dataset=lm_test_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
)
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__, special_tokens_mask. If __index_level_0__, special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 189
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 240


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.884815,0.585987
2,No log,2.445691,0.625954
3,No log,2.152229,0.651316
4,No log,1.977283,0.7
5,No log,3.203666,0.595588
6,No log,2.190649,0.656442
7,No log,2.145309,0.628205
8,No log,2.07512,0.681818
9,No log,2.415602,0.646667
10,No log,2.063743,0.634615


The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__, special_tokens_mask. If __index_level_0__, special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 8
Saving model checkpoint to models/bert-base-uncased-airbnb_london_20220910-mask/checkpoint-12
Configuration saved in models/bert-base-uncased-airbnb_london_20220910-mask/checkpoint-12/config.json
Model weights saved in models/bert-base-uncased-airbnb_london_20220910-mask/checkpoint-12/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__, special_tokens_mask. If __index_level_0__, special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
 

TrainOutput(global_step=240, training_loss=2.0653133392333984, metrics={'train_runtime': 237.6427, 'train_samples_per_second': 15.906, 'train_steps_per_second': 1.01, 'total_flos': 248728548096000.0, 'train_loss': 2.0653133392333984, 'epoch': 20.0})

**Compute metrics on validation set**

In [22]:
metrics = trainer.evaluate()

df_test_report = pd.DataFrame(data=[metrics]) \
                   .transpose() \
                   .reset_index() \
                   .rename(columns={"index": "metric", 0: "value"})
df_test_report

The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__, special_tokens_mask. If __index_level_0__, special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 8


Unnamed: 0,metric,value
0,eval_loss,1.751728
1,eval_accuracy,0.737226
2,eval_runtime,0.086
3,eval_samples_per_second,116.267
4,eval_steps_per_second,23.253
5,epoch,20.0


**Serialize fine-tuned model**

In [23]:
save_model_and_tokenizer(model=model, tokenizer=tokenizer, save_directory=MODEL_OUTPUT_DIR)

Configuration saved in models/bert-base-uncased-airbnb_london_20220910-mask/config.json
Model weights saved in models/bert-base-uncased-airbnb_london_20220910-mask/pytorch_model.bin
tokenizer config file saved in models/bert-base-uncased-airbnb_london_20220910-mask/tokenizer_config.json
Special tokens file saved in models/bert-base-uncased-airbnb_london_20220910-mask/special_tokens_map.json
