# Pre-training SmallBERTa - A tiny model to train on a tiny dataset
(Using HuggingFace Transformers)<br>
Admittedly, while language modeling is associated with terabytes of data, not all of use have either the processing power nor the resources to train huge models on such huge amounts of data.
In this example, we are going to train a relatively small neural net on a small dataset (which still happens to have over 2M rows).
<br>

The ***main purpose*** of this blog is not to achieve state-of-the-art performance on LM tasks but to show a simple idea of how the recent language_modeling.py script can be used to train a Transformer model from scratch.

This very notebook can be extended to various esoteric use cases where general purpose pre-trained models fail to perform well. Examples include medical dataset, scientific literature, legal documentation, etc.

Input:
  1. To the Tokenizer:<br>
      LM data in a directory containing all samples in separate *.txt files.
  
  2. To the Model:<br>
      LM data split into:<br>
        1. train.txt <br>
        2. eval.txt 
        
Output:<br>
  Trained Model weights(that can be used elsewhere) and Tensorboard logs

## Install Dependencies

In [172]:
#tokenizer working version --- 0.5.0
#transformer working version --- 2.5.0
!pip install transformers
!pip install tokenizers
!pip install tensorboard==2.1.0



## Fetch Data
We will be using a tiny dataset(The Examiner - SpamClickBait News) of around 3M rows from kaggle to train our model. The dataset also contains output labels which will be dropped and only the text shall be used. For convenience we are using the Kaggle API to direcltly download the data from Kaggle to save our time and efforts. 

In [0]:
import os
import getpass

#For a kaggle username & key, just go to your kaggle account and generate key
#The JSON file so downloaded contains both of them
if("examine-the-examiner.zip" not in os.listdir()):
  print("Copy these two values from the JSON file so generated")
  os.environ['KAGGLE_USERNAME'] = getpass.getpass(prompt='Kaggle username: ') 
  os.environ['KAGGLE_KEY'] =  getpass.getpass(prompt='Kaggle key: ')
  !kaggle datasets download -d therohk/examine-the-examiner
  !unzip /content/examine-the-examiner.zip

Copy these two values from the JSON file so generated
Kaggle username: ··········
Kaggle key: ··········
Downloading examine-the-examiner.zip to /content
 86% 123M/142M [00:00<00:00, 132MB/s]
100% 142M/142M [00:00<00:00, 163MB/s]
Archive:  /content/examine-the-examiner.zip
  inflating: examiner-date-text.csv  
  inflating: examiner-date-tokens.csv  


## Load and Preprocess data

In [0]:
import regex as re
def basicPreprocess(text):
  try:
    processed_text = text.lower()
    processed_text = re.sub(r'\W +', ' ', processed_text)
  except Exception as e:
    print("Exception:",e,",on text:", text)
    return None
  return processed_text

In [0]:
import pandas as pd
from tqdm import tqdm

In [175]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Read and Prune the data
For our purpose we are going to read a subset (~200,000 samples) to train, just to see results quickly. Feel free to increase (or remove) this limitation.  

In [183]:
data = pd.read_csv("/content/drive/My Drive/examiner-date-text.csv")
print(data)

         publish_date                                      headline_text
0            20100101       100 Most Anticipated books releasing in 2010
1            20100101       10 best films of 2009 - What's on your list?
2            20100101  10 days of free admission at Lan Su Chinese Ga...
3            20100101      10 PlayStation games to watch out for in 2010
4            20100101  10 resolutions for a Happy New Year for you an...
...               ...                                                ...
3089776      20151231  Which is better investment, Lego bricks or gol...
3089777      20151231  Wild score three unanswered goals to defeat th...
3089778      20151231  With NASA and Russia on the sidelines, Europe ...
3089779      20151231  Wolf Pack battling opponents, officials on the...
3089780      20151231          Writespace hosts all genre open mic night

[3089781 rows x 2 columns]


In [0]:
data = data.sample(frac=1).sample(frac=1)
data = data[:5000]

### Before Preprocessing 

In [185]:
print(data)

         publish_date                                      headline_text
2404600      20130531  ‘Pokemon X’ and ‘Pokemon Y’ EVs attribute info...
1029672      20110207    SFA's 'A Cappella Choir' Presents Concert Today
1080823      20110302  Nintendo President reveals 3DS Super Mario at ...
968212       20110111                 The Whisky a go-go opens its doors
2653615      20140125                Maryland's media: can it be trusted
...               ...                                                ...
1401063      20110829  The New York Mets launched Anti-Bullying Campa...
1268249      20110607                   THIN AIR features Peter Robinson
2609929      20131211  Free ‘GTA V’ Online deathmatch & race creators...
1690518      20120218                         Ubu Rex and the ridiculous
1977367      20120814  Gamescom 2012: Battlefield 3 Armored Kill scre...

[5000 rows x 2 columns]


In [0]:
data["headline_text"] = data["headline_text"].apply(basicPreprocess).dropna() #ignore exception if for empty/nan values

### After Preprocessing

In [187]:
print(data)

         publish_date                                      headline_text
2404600      20130531  ‘pokemon x and ‘pokemon y evs attribute inform...
1029672      20110207     sfa's 'a cappella choir presents concert today
1080823      20110302  nintendo president reveals 3ds super mario at ...
968212       20110111                 the whisky a go-go opens its doors
2653615      20140125                 maryland's media can it be trusted
...               ...                                                ...
1401063      20110829  the new york mets launched anti-bullying campa...
1268249      20110607                   thin air features peter robinson
2609929      20131211  free ‘gta v online deathmatch  race creators u...
1690518      20120218                         ubu rex and the ridiculous
1977367      20120814  gamescom 2012 battlefield 3 armored kill scree...

[5000 rows x 2 columns]


Removing newline characters just in case the input text has them. This is because the LineByLine class that we are going to use later assumes that samples are separated by newline

In [188]:
data = data[['headline_text']]
print(data)

                                             headline_text
2404600  ‘pokemon x and ‘pokemon y evs attribute inform...
1029672     sfa's 'a cappella choir presents concert today
1080823  nintendo president reveals 3ds super mario at ...
968212                  the whisky a go-go opens its doors
2653615                 maryland's media can it be trusted
...                                                    ...
1401063  the new york mets launched anti-bullying campa...
1268249                   thin air features peter robinson
2609929  free ‘gta v online deathmatch  race creators u...
1690518                         ubu rex and the ridiculous
1977367  gamescom 2012 battlefield 3 armored kill scree...

[5000 rows x 1 columns]


In [0]:
# data.to_csv('data.txt',index=False,header=False, quoting=csv.QUOTE_NONE)

In [194]:
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer, BertWordPieceTokenizer

paths = [str(x) for x in Path("/content").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = BertWordPieceTokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "[UNK]",
    "[SEP]",
    "[CLS]",
    "[MASK]",
    "[PAD]",
])

# Save files to disk
tokenizer.save(".", "smallbert")

['./smallbert-vocab.txt']

In [0]:
!mv /content/smallbert-vocab.txt /content/vocab.txt

In [207]:
tokenizer = BertWordPieceTokenizer("/content/vocab.txt")
tokenizer._tokenizer.post_processor = BertProcessing(
    ("[SEP]", tokenizer.token_to_id("[CLS]")),
    ("[CLS]", tokenizer.token_to_id("[SEP]")),
)
tokenizer.enable_truncation(max_length=512)

print(
    tokenizer.encode("the whisky a go-go opens its doors").tokens
)

['[CLS]', 'the', 'whisky', 'a', 'go', '-', 'go', 'opens', 'its', 'doors', '[SEP]']


## Train a custom tokenizer
I have used a ByteLevelBPETokenizer just to prevent \<unk> tokens entirely.
Furthermore, the function used to train the tokenizer assumes that each sample is stored in a different text file.

In [0]:
txt_files_dir = "/tmp/text_split"
!mkdir {txt_files_dir}

Split LM data into individual files. These files are stored in /tmp/text_split and are used to train the tokenizer **only**.

In [0]:
i=0
for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

100%|██████████| 200000/200000 [00:09<00:00, 20693.63it/s]


In [0]:
lm_data_dir = "/tmp/lm_data"
!mkdir {lm_data_dir}

## Split into Valdation and Train set
We split the train data into validation and train. These two files are used to train and evaluate our model

In [0]:
train_split = 0.9
train_data_size = int(len(data)*train_split)

with open(os.path.join(lm_data_dir,'train.txt') , 'w') as f:
    for item in data[:train_data_size].tolist():
        f.write("%s\n" % item)

with open(os.path.join(lm_data_dir,'eval.txt') , 'w') as f:
    for item in data[train_data_size:].tolist():
        f.write("%s\n" % item)

In [0]:
!mkdir /content/models
!mkdir /content/models/smallBERTa

In [198]:
tokenizer.save("/content/models/smallBERTa", "smallBERTa")

['/content/models/smallBERTa/smallBERTa-vocab.json',
 '/content/models/smallBERTa/smallBERTa-merges.txt']

In [0]:
!mv /content/models/smallBERTa/smallBERTa-vocab.json /content/models/smallBERTa/vocab.json
!mv /content/models/smallBERTa/smallBERTa-merges.txt /content/models/smallBERTa/merges.txt

In [0]:
train_path = os.path.join(lm_data_dir,"train.txt")
eval_path = os.path.join(lm_data_dir,"eval.txt")

## Set Model Configuration
For our purpose, we are training a very small model for demo purposes

In [0]:
import json
config = {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.3,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "num_attention_heads": 1,
  "num_hidden_layers": 1,
  "vocab_size": vocab_size,
  "intermediate_size": 256,
  "max_position_embeddings": 256
}
with open("/content/models/smallBert/config.json", 'w') as fp:
    json.dump(config, fp)

In [0]:
#%cd /content
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 19858 (delta 5), reused 6 (delta 0), pack-reused 19834[K
Receiving objects: 100% (19858/19858), 11.95 MiB | 4.05 MiB/s, done.
Resolving deltas: 100% (14423/14423), done.


## Run training using the run_language_modeling.py examples script

In [0]:
!nvidia-smi #just to confirm that you are on a GPU, if not go to Runtime->Change Runtime

Fri Feb 21 12:17:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [214]:
#Setting environment variables
os.environ["train_path"] = train_path
os.environ["eval_path"] = eval_path
os.environ["CUDA_LAUNCH_BLOCKING"]='1'  #Makes for easier debugging (just in case)
weights_dir = "/content/models/smallBert/weights"
!mkdir {weights_dir}

mkdir: cannot create directory ‘/content/models/smallBert/weights’: File exists


In [215]:
train_path

'/tmp/lm_data/train.txt'

In [0]:
!mkdir /content/models/smallBert

In [0]:
!mv /content/vocab.txt /content/models/smallBert

In [0]:
cmd = '''python /content/transformers/examples/run_language_modeling.py --output_dir {0}  \
    --model_type bert \
    --mlm \
    --train_data_file {1} \
    --eval_data_file {2} \
    --config_name /content/models/smallBert \
    --tokenizer_name /content/models/smallBert \
    --do_train \
    --line_by_line \
    --overwrite_output_dir \
    --do_eval \
    --block_size 256 \
    --learning_rate 1e-4 \
    --num_train_epochs 5 \
    --save_total_limit 2 \
    --save_steps 2000 \
    --logging_steps 500 \
    --per_gpu_eval_batch_size 32 \
    --per_gpu_train_batch_size 32 \
    --evaluate_during_training \
    --seed 42 \
    '''.format(weights_dir, train_path, eval_path)

In [217]:
!{cmd}

02/27/2020 11:15:58 - INFO - transformers.configuration_utils -   loading configuration file /content/models/smallBert/config.json
02/27/2020 11:15:58 - INFO - transformers.configuration_utils -   Model config BertConfig {
  "architectures": null,
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": null,
  "do_sample": false,
  "eos_token_ids": null,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.3,
  "hidden_size": 128,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 256,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 256,
  "model_type": "bert",
  "num_attention_heads": 1,
  "num_beams": 1,
  "num_hidden_layers": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_

In [0]:
from transformers import BertModel

bert = BertModel.from_pretrained('/content/models/smallBert/model/')

tokenizer = BertWordPieceTokenizer("/content/models/smallBert/vocab.txt")


In [0]:
!mkdir /content/models/smallBert/model

In [0]:
!cp /content/models/smallBert/weights/config.json /content/models/smallBert/model
!cp /content/models/smallBert/weights/pytorch_model.bin /content/models/smallBert/model
!cp /content/models/smallBert/weights/vocab.txt /content/models/smallBert/model

In [0]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=bert,
    tokenizer=tokenizer
)

In [226]:
!pip install --upgrade transformers

Requirement already up-to-date: transformers in /usr/local/lib/python3.6/dist-packages (2.5.1)


In [243]:
text = "the whisky a go-go opens its door"
tokenized_text = tokenizer.encode(text).tokens
tokenized_text

['[CLS]',
 'the',
 'whisky',
 'a',
 'go',
 '-',
 'go',
 'opens',
 'its',
 'door',
 '[SEP]']

In [244]:
masked_index = 9
tokenized_text[masked_index] = '[MASK]'
tokenized_text

['[CLS]',
 'the',
 'whisky',
 'a',
 'go',
 '-',
 'go',
 'opens',
 'its',
 '[MASK]',
 '[SEP]']

In [245]:
segments_ids = tokenizer.encode(text).type_ids
indexed_tokens = tokenizer.encode(text).ids
indexed_tokens 

[2, 1183, 10535, 43, 1459, 17, 1459, 6665, 2769, 3914, 1]

In [0]:
import torch

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

In [0]:
from transformers import BertForMaskedLM

model = BertForMaskedLM.from_pretrained('/content/models/smallBert/model/')

In [0]:
# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

In [250]:
predictions

tensor([[[-0.1060, -0.1224,  0.3537,  ..., -0.1021,  0.0311, -0.5150],
         [-0.0526, -0.0738,  0.3816,  ..., -0.0949, -0.2198, -0.4439],
         [ 0.0329,  0.0447,  0.3771,  ...,  0.0593,  0.0518, -0.5494],
         ...,
         [ 0.0656, -0.1294,  0.0816,  ..., -0.0535, -0.0254, -0.3831],
         [-0.0361, -0.2877,  0.4497,  ..., -0.0881,  0.0264, -0.6028],
         [-0.0789, -0.0009,  0.1128,  ...,  0.0317, -0.1268, -0.5132]]])

In [252]:
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()

predicted_index

# predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]


4439

In [258]:
tokenizer.id_to_token(4439)

'common'

## View Results on Tensorboard

In [0]:
!tensorboard dev upload --logdir /content/runs


***** TensorBoard Uploader *****

This will upload your TensorBoard logs to https://tensorboard.dev/ from
the following directory:

/content/runs

This TensorBoard will be visible to everyone. Do not upload sensitive
data.

Your use of this service is subject to Google's Terms of Service
<https://policies.google.com/terms> and Privacy Policy
<https://policies.google.com/privacy>, and TensorBoard.dev's Terms of Service
<https://tensorboard.dev/policy/terms/>.

This notice will not be shown again while you are logged into the uploader.
To log out, run `tensorboard dev auth revoke`.

Continue? (yes/NO) yes

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=373649185512-8v619h5kft38l4456nm2dj4ubeqsrvh6.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email&state=kgAdxJj3xxL6gDgTUoUWbPVrkXeIzl&prompt=consent&access_type=offline