# __BERT retrain__

Modified from [the Python Code](https://thepythoncode.com/article/pretraining-bert-huggingface-transformers-in-python)
- [Google Colab](https://colab.research.google.com/drive/1An1VNpKKMRVrwcdQQNSe7Omh_fl2Gj-2?usp=sharing&pli=1#scrollTo=-CVoZ3bC_j6K)

## ___Setup___

### Install

```bash
conda create -n bert python
conda activate bert
```

In [1]:
#%pip install datasets transformers==4.11.2 sentencepiece
%pip install datasets transformers sentencepiece
%pip install ipywidgets
%pip install pytest
%pip install xformers
%pip install session_info
%pip install accelerate -U

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Import

In [1]:
import os, json, session_info
import pandas as pd
from pathlib import Path
from datasets import Dataset
from transformers import BertTokenizerFast, BertConfig, BertForMaskedLM, \
                         DataCollatorForLanguageModeling, TrainingArguments, \
                         Trainer, pipeline
from tokenizers import BertWordPieceTokenizer

### Folders and rand_seed

In [2]:
rand_seed = 20231002

work_dir = Path.home() / "projects/plantbert"
work_dir.mkdir(parents=True, exist_ok=True)

corpus_file = work_dir / "corpus_with_topics.tsv.gz"

model_dir = work_dir / "models"
model_dir.mkdir(parents=True, exist_ok=True)

os.chdir(work_dir)
print("cwd:", os.getcwd())

cwd: /mnt/d/projects/plantbert


In [3]:
# write_req_file only write imported modules
session_info.show(na=True, jupyter=True, dependencies=True, html=False,
                  write_req_file=True)

-----
datasets            2.14.5
pandas              2.1.1
session_info        1.0.0
tokenizers          0.13.3
transformers        4.33.3
-----
accelerate          0.23.0
aiohttp             3.8.5
aiosignal           1.3.1
asttokens           NA
async_timeout       4.0.3
attr                23.1.0
backcall            0.2.0
certifi             2023.07.22
charset_normalizer  3.3.0
comm                0.1.4
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.7
decorator           5.1.1
dill                0.3.7
executing           1.2.0
filelock            3.12.4
frozenlist          1.4.0
fsspec              2023.6.0
huggingface_hub     0.17.3
idna                3.4
ipykernel           6.25.2
ipywidgets          8.1.1
jedi                0.19.0
mpmath              1.3.0
multidict           6.0.4
multiprocess        0.70.15
numpy               1.26.0
nvfuser             NA
packaging           23.2
parso               0.8.3
pexpect             4.8.0
pickleshare       

## ___Process raw data___

Info on dataset creation
- [Hugging face dataset creation](https://huggingface.co/docs/datasets/create_dataset)
- [Huggingface NLP course](https://huggingface.co/learn/nlp-course/chapter5/5)
- [Pd dataframe to dataset](https://discuss.huggingface.co/t/from-pandas-dataframe-to-huggingface-dataset/9322)

Corpus: table4_2_corpus_with_topic_assignment.tsv
- This is from the plant science history project consisted of 421658 PubMed entries classified as plant science papers.

### Convert corpus to dataset

Steps
- gz to Pandas DataFrame
- dataframe to dataset

In [4]:
# Read gz as a pandas dataframe
corpus = pd.read_csv(corpus_file, sep="\t", compression="gzip")
corpus.head(2)

Unnamed: 0.1,Unnamed: 0,Index_1385417,PMID,Date,Journal,Title,Abstract,Initial filter qualifier,Corpus,reg_article,Text classification score,Preprocessed corpus,Topic
0,0,3,61,1975-12-11,Biochimica et biophysica acta,Identification of the 120 mus phase in the dec...,After a 500 mus laser flash a 120 mus phase in...,spinach,Identification of the 120 mus phase in the dec...,1,0.716394,identification 120 mus phase decay delayed flu...,52
1,1,4,67,1975-11-20,Biochimica et biophysica acta,Cholinesterases from plant tissues. VI. Prelim...,Enzymes capable of hydrolyzing esters of thioc...,plant,Cholinesterases from plant tissues. VI. Prelim...,1,0.894874,cholinesterases plant tissues . vi . prelimina...,48


In [7]:
# Only take the txt column
corpus_txt = corpus[["Corpus", "Topic"]]
corpus_txt.columns = ["text", "label"]
corpus_txt.shape

(421307, 2)

In [8]:
# Create a dataset from the pandas dataframe
dataset = Dataset.from_pandas(corpus_txt)
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 421307
})

### Split train/test

In [11]:
# split the dataset into training (90%) and testing (10%)
d = dataset.train_test_split(test_size=0.1, seed=rand_seed)

d["train"], d["test"]

(Dataset({
     features: ['text', 'label'],
     num_rows: 379176
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 42131
 }))

In [12]:
for t in d["train"]["text"][:3]:
  print(t[:50],"...")

Chlorophyll degradation during senescence. The cat ...
vsiRNAs derived from the miRNA-generating sites of ...
Postharvest physiology and technology  of loquat ( ...


### Dataset to text for training tokenizer

In [None]:
def dataset_to_text(dataset, output_filename="data.txt"):
  """Utility function to save dataset text to disk, useful for using the texts 
  to train the tokenizer (as the tokenizer accepts files)"""
  with open(output_filename, "w") as f:
    for t in dataset["text"]:
      print(t, file=f)

# save the training set to train.txt
dataset_to_text(d["train"], "train.txt")

# save the testing set to test.txt
dataset_to_text(d["test"], "test.txt")

## ___Tokenize datasets___

### Settings

Encode with trucation is defult.

In [None]:
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"]

# training the tokenizer on the training set
files = ["train.txt"]

# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522

# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512

### Train and save tokenizer

Also see [this code](https://github.com/huggingface/tokenizers/blob/main/bindings/python/examples/train_bert_wordpiece.py)

In [None]:
# initialize the WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()

# train the tokenizer
tokenizer.train(files=files, vocab_size=vocab_size, special_tokens=special_tokens)

# enable truncation up to the maximum 512 tokens
tokenizer.enable_truncation(max_length=max_length)






In [None]:
# save the tokenizer  
tokenizer.save_model(str(model_dir))

['/home/shius/projects/plantbert/models/vocab.txt']

In [None]:
# dumping some of the tokenizer config to config file, 
# including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(model_dir, "config.json"), "w") as f:
  tokenizer_cfg = {
      "do_lower_case": True,
      "unk_token": "[UNK]",
      "sep_token": "[SEP]",
      "pad_token": "[PAD]",
      "cls_token": "[CLS]",
      "mask_token": "[MASK]",
      "model_max_length": max_length,
      "max_len": max_length,
  }
  json.dump(tokenizer_cfg, f)

### Tokenize dataset

In [16]:
# when the tokenizer is trained and configured, load it as BertTokenizerFast
btz_tokenizer = BertTokenizerFast.from_pretrained(model_dir)

In [17]:
def encode(examples):
  """Mapping function to tokenize the sentences passed with truncation"""
  return btz_tokenizer(examples["text"], 
                   truncation=True, 
                   padding="max_length", 
                   max_length=max_length, 
                   return_special_tokens_mask=True)

In [19]:
# tokenizing the train dataset
train_dataset = d["train"].map(encode, batched=True)

# tokenizing the testing dataset
test_dataset = d["test"].map(encode, batched=True)

Map:   0%|          | 0/379492 [00:00<?, ? examples/s]

Map:   0%|          | 0/42166 [00:00<?, ? examples/s]

## ___Retrain BERT___

### Set format

In [20]:
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

train_dataset, test_dataset

(Dataset({
     features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
     num_rows: 379492
 }),
 Dataset({
     features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
     num_rows: 42166
 }))

### Setup before retraining

Start out with BertForMaskedLM
- Initialize BERT model with some config set
- Initialize data collator to mask tokens
- Set training arguments
- Intiailize trainer

In [25]:
# initialize the model with the config
model_config = BertConfig(vocab_size=vocab_size, 
                          max_position_embeddings=max_length)

model = BertForMaskedLM(config=model_config)

In [28]:
# initialize the data collator, randomly masking 20% (default is 15%) of the 
# tokens for the Masked Language Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)

In [37]:
# SHS: batch size, CUDA out of memory
#   30: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 0; 24.00 GiB total capacity; 22.44 GiB already allocated
#   25: OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB (GPU 0; 24.00 GiB total capacity; 22.89 GiB already allocated

training_args = TrainingArguments(
    output_dir=model_dir,           # output dir to where save model checkpoint
    evaluation_strategy="steps",    # evaluate each `logging_steps` steps
    overwrite_output_dir=True,      
    num_train_epochs=10,            # num of training epochs, feel free to tweak
    per_device_train_batch_size=20, # training batch size, depend on GPU memory
    gradient_accumulation_steps=8,  # accumulate gradients before weights update 
    per_device_eval_batch_size=64,  # evaluation batch size
    logging_steps=500,              # evaluate, log and save every 500 steps
    save_steps=500,
    save_safetensors=True,          # save SafeTensors instead of Tensors 
    # load_best_model_at_end=True,  # best in terms of loss
    # save_total_limit=3,           # save 3 model weights to save space
)

In [38]:
# initialize the trainer and pass everything to it
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

### Train model

In [39]:
# train the model
trainer.train()

Step,Training Loss,Validation Loss
500,6.9455,6.516377
1000,6.4374,6.288446


In [None]:
# when you load from pretrained
# model = BertForMaskedLM.from_pretrained(os.path.join(model_dir, "checkpoint-10000"))
# tokenizer = BertTokenizerFast.from_pretrained(model_dir)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

In [None]:
# perform predictions
example = "It is known that [MASK] is the capital of Germany"
for prediction in fill_mask(example):
  print(prediction)

## ___Testing___

### Examine the dataset

In [None]:
dataset.__len__, dataset._format_columns

In [None]:
diter = dataset.iter(batch_size=5)
dnext = diter.__next__()
type(diter), type(dnext), dnext.keys()

In [None]:
for key in dnext:
  print(f"{key}: length={len(dnext[key])}, type={type(dnext[key])}")
  for item in dnext[key]:
    if type(item) == str: 
      print(f"  {item[:20]}")
    else:
      print(f"  {item}")

### Using custom dataset

In [None]:
# if you have your custom dataset 
# dataset = LineByLineTextDataset(
#     tokenizer=tokenizer,
#     file_path="path/to/data.txt",
#     block_size=64,
# )

In [None]:
# or if you have huge custom dataset separated into files
# load the splitted files
# files = ["train1.txt", "train2.txt"] # train3.txt, etc.
# dataset = load_dataset("text", data_files=files, split="train")

### Reduce dataset size

Original data:
- The original has 708241 entries and, after train/test split, these are the training setup:
  - Num examples = 637,416
  - Num Epochs = 10
  - Instantaneous batch size per device = 20
  - Total train batch size (w. parallel, distributed & accumulation) = 160
  - Gradient Accumulation steps = 8
  - Total optimization steps = 39,830
  - Number of trainable parameters = 109,514,298
- Killed the process after this amount of progress:
  - 16/39830 25:58 < 1231:01:01, 0.01 it/s, Epoch 0.00/10
  - See the calculation below it would take 45 days/epoch (close to the 1231 hrs above but not sure it is per epoch or total).
- So reduce the dataset set to ~1% of the original (7000), just to see how the training requirement is.

1% cc_news data
- Now has 7000 entries
  - Num examples = 6,300
  - Num Epochs = 10
  - Instantaneous batch size per device = 25
  - Total train batch size (w. parallel, distributed & accumulation) = 200
  - Gradient Accumulation steps = 8
  - Total optimization steps = 310
  - Number of trainable parameters = 109,514,298
- Killed the process:
  - 6/310 07:50 < 9:55:52, 0.01 it/s, Epoch 0.16/10
  - So still need 10 hours!! Man...

In [None]:
sec_per_epoch = 39830/16*(26*60)
hrs  = sec_per_epoch/60/60
days = hrs/24
sec_per_epoch, hrs, days

In [None]:
# Load dataset from disk
#https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html
#https://discuss.huggingface.co/t/load-dataset-from-arrow-file/24999/2

# Process, split/splice dataset
#https://huggingface.co/docs/datasets/v1.4.1/splits.html
#https://huggingface.co/docs/datasets/v1.4.1/processing.html

dataset_small = dataset.select(range(7000))
len(dataset_small)