# NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece

* **Author**: [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/)
* **Date**: April 05, 2021
* **Blog post**: [NLP & domain specific | How to add a specialized vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece](https://medium.com/@pierre_guillou/nlp-how-to-add-a-domain-specific-vocabulary-new-tokens-to-a-subword-tokenizer-already-trained-33ab15613a41)

**Summary**: In some cases, it may be crucial to enrich the vocabulary of an already trained natural language model with vocabulary from a specialized domain (medicine, law, etc.) in order to perform new tasks (classification, NER, summary, translation, etc.). While the Hugging Face library allows you to easily add new tokens to the vocabulary of an existing tokenizer like BERT WordPiece, those tokens must be whole words, not subwords. This article explains why and how to obtain these new tokens from a specialized corpus.

## Install libraries

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


In [2]:
# Install last Hugging Face libraries (datasets & transformers)
!pip install datasets git+https://github.com/huggingface/transformers/
# install spaCY
!pip install -U pip setuptools wheel
!pip install -U spacy[cuda110]
!python -m spacy download en_core_web_sm
# install scikit-learn
!pip install -U scikit-learn
# install matplotlib
!pip install matplotlib
# install wikipedia
!pip install wikipedia

Collecting git+https://github.com/huggingface/transformers/
  Cloning https://github.com/huggingface/transformers/ to /tmp/pip-req-build-kvgmr32c
  Running command git clone -q https://github.com/huggingface/transformers/ /tmp/pip-req-build-kvgmr32c
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 4.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 38.0MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 4.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 36 kB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.1.0
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for c

## Download a BERT model and its WordPiece tokenizer

In [6]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Tokenize a phrase about COVID

In [7]:
text = "COVID-19 affects different people in different ways. Most infected people will develop mild to moderate illness and recover without hospitalization."

In [8]:
# tokenization of the text
tokens = tokenizer.tokenize(text)
print(tokens)

['CO', '##VI', '##D', '-', '19', 'affects', 'different', 'people', 'in', 'different', 'ways', '.', 'Most', 'infected', 'people', 'will', 'develop', 'mild', 'to', 'moderate', 'illness', 'and', 'recover', 'without', 'hospital', '##ization', '.']


In [9]:
# back to text
tokenizer.decode(tokenizer.encode(text), skip_special_tokens=True)

'COVID - 19 affects different people in different ways. Most infected people will develop mild to moderate illness and recover without hospitalization.'

In [10]:
print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))

['CO', '##VI', '##D']
['hospital', '##ization']


**We can notice that the BERT WordPiece tokenizer (from the bert-base-cased model) tokenize the words COVID and hospitalization with subwords because they do not exist as words in the tokenizer vocabulary.**

In [11]:
# Verify that the words COVID and hospitalization DO NOT belong to the tokenizer vocabulary
vocab = [tok for tok,index in tokenizer.get_vocab().items()]
"COVID" in vocab, "hospitalization" in vocab

(False, False)

## [ First test ] Add 2 new tokens (whole words) into the tokenizer vocab

In [12]:
new_tokens = ['COVID', 'hospitalization']

In [13]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer)) 
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer)) 
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model 
model.resize_token_embeddings(len(tokenizer)) 

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 28998

added_tokens: 2



Embedding(28998, 768)

In [14]:
# Verify that the words COVID and hospitalization DO belong to the tokenizer vocabulary
vocab = [tok for tok,index in tokenizer.get_vocab().items()]
"COVID" in vocab, "hospitalization" in vocab

(True, True)

Let's call tokenizer_exBERT our tokenizer with the 2 new tokens.

In [15]:
tokenizer_exBERT = tokenizer

In [16]:
# tokenization of the text
tokens = tokenizer_exBERT.tokenize(text)
print(tokens)

['COVID', '-', '19', 'affects', 'different', 'people', 'in', 'different', 'ways', '.', 'Most', 'infected', 'people', 'will', 'develop', 'mild', 'to', 'moderate', 'illness', 'and', 'recover', 'without', 'hospitalization', '.']


In [17]:
# back to text
tokenizer_exBERT.decode(tokenizer_exBERT.encode(text), skip_special_tokens=True)

'COVID - 19 affects different people in different ways. Most infected people will develop mild to moderate illness and recover without hospitalization.'

**The tokenizer with the 2 new tokens succeeded in tokenizing the words COVID and hospitalization without subwords as they belong now to the vocabulary tokenizer.**

In [18]:
# tokenization of the words COVID and hospitalization
print(tokenizer_exBERT.tokenize('COVID'))
print(tokenizer_exBERT.tokenize('hospitalization'))

['COVID']
['hospitalization']


## [ Second test ] Add more new tokens (subwords and words) into the tokenizer vocab

What if we want to detect the whole vocabulary of a specialized corpus (and not only 2 words) in order to add it to an existing corpus? 

Let's use a WordpIece tokenizer for this! (Why a WordPiece tokenizer? This is our first guess: since the BERT tokenizer is a WordPiece tokenizer, let's use a tokenizer of the same type)

### 1) Import pages about COVID from English Wikipedia

In [19]:
import wikipedia

# let's choose 2 Wikipedia pages for our demonstration (we could have choosen an infinity)
pages = ["COVID-19","COVID-19 pandemic"]

documents = list()
for p in pages:
  page = wikipedia.page(p)
  documents.append(page.content)
  print(page.title,page.url)

COVID-19 https://en.wikipedia.org/wiki/COVID-19
COVID-19 pandemic https://en.wikipedia.org/wiki/COVID-19_pandemic


### 2) Train a WordPiece tokenizer on the imported Wikipedia pages

Source: [All together: a BERT tokenizer from scratch](https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#all-together-a-bert-tokenizer-from-scratch)

In [20]:
# tokenzer WordPiece
from tokenizers import Tokenizer
from tokenizers.models import WordPiece

bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# normalizer
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents

bert_tokenizer.normalizer = normalizers.Sequence([NFD()])

# pre-tokenizer
from tokenizers.pre_tokenizers import Whitespace

bert_tokenizer.pre_tokenizer = Whitespace()

# template
from tokenizers.processors import TemplateProcessing

bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

# instantiate a trainer
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
    vocab_size=30522, 
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
    )

# Train 
files = documents
bert_tokenizer.train_from_iterator(files, trainer)

### 3) Get the vocabulary that is not in the original BERT tokenizer

This step is not necessary, as the `tokenizer.add_tokens()` method will add new tokens only if they do not belong to the existing tokenizer vocabulary. However, it helps us to see what these new tokens are.

In [21]:
old_vocab = [k for k,v in tokenizer.get_vocab().items()]
new_vocab = [k for k,v in bert_tokenizer.get_vocab().items()]
idx_old_vocab_list = list()
same_tokens_list = list()
different_tokens_list = list()

for idx_new,w in enumerate(new_vocab): 
  try:
    idx_old = old_vocab.index(w)
  except:
    idx_old = -1
  if idx_old>=0:
      idx_old_vocab_list.append(idx_old)
      same_tokens_list.append((w,idx_new))
  else:
      different_tokens_list.append((w,idx_new))

In [22]:
len(same_tokens_list),len(different_tokens_list),len(same_tokens_list)+len(different_tokens_list)

(4747, 3666, 8413)

['Protocol']

**We found 3651 tokens (subwords or words) that are not in the vocabulary of the original tokenizer.**

In [23]:
# get list of new tokens
new_tokens = [k for k,v in different_tokens_list]
len(new_tokens), new_tokens[:10]

(3666,
 ['Guilds',
  'TN',
  '##ogle',
  'Available',
  '##hion',
  '##arating',
  'val',
  '##ominantly',
  'cor',
  '##ension'])

### 4) Add the new tokens (subwords and words) in the vocabulary of the original BERT tokenizer

In [24]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer)) 
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer)) 
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model 
model.resize_token_embeddings(len(tokenizer)) 

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 32662

added_tokens: 3666



Embedding(32662, 768)

In [27]:
# Verify if  the words COVID and hospitalization belong or not to the tokenizer vocabulary
vocab = [tok for tok,index in tokenizer.get_vocab().items()]
"COVID" in vocab, "hospitalization" in vocab

(False, False)

Let's call tokenizer_exBERT our tokenizer with the new tokens.

In [28]:
tokenizer_exBERT = tokenizer

In [29]:
# tokenization of the text
tokens = tokenizer_exBERT.tokenize(text)
print(tokens)

['COV', 'ID', '-', '19', 'affec', 't', '##s', 'dif', 'fe', 'rent', 'pe', 'o', 'ple', 'in', 'dif', 'fe', 'rent', 'ways', '.', 'Mo', 'st', 'in', 'fe', 'c', '##ted', 'pe', 'o', 'ple', 'will', 'd', 'ev', 'e', 'lop', 'mil', 'd', 'to', 'mod', 'e', 'ra', 'te', 'ill', '##n', 'ess', 'and', 'rec', 'over', 'without', 'ho', 'sp', 'i', 'tal', 'i', '##zation', '.']


In [30]:
# back to text
tokenizer_exBERT.decode(tokenizer_exBERT.encode(text), skip_special_tokens=True)

'COV ID - 19 affec ts dif fe rent pe o ple in dif fe rent ways. Mo st in fe cted pe o ple will d ev e lop mil d to mod e ra te illn ess and rec over without ho sp i tal ization.'

**As the words COVID and hospitalization do not belong to the tokenizer vocabulary, they continue to be tokenized with subwords. That's right.**

**However, only the word COVID is well tokenized: the word hospitalization is tokenized with subwords that do not start with ##. But except the first token, all other subword tokens should have started with ##!**

**And we can see that many other words in the sentence are not well tokenized, too.**

In [34]:
# tokenization of the words COVID and hospitalization
print(tokenizer_exBERT.tokenize('COVID'))
print(tokenizer_exBERT.tokenize('hospitalization'))

['COVID']
['hospitalization']


### 5) Add only the new tokens that do not start with ## in the vocabulary of the original BERT tokenizer

We know that a subword is not just a token that starts with ##, but let's see what happens if we remove all those subwords from the list of new tokens.

In [31]:
# get list of new tokens as whole words
new_tokens = [tok for tok in new_tokens if tok.startswith("#") == False]
len(new_tokens), new_tokens[:10]

(2499,
 ['Guilds',
  'TN',
  'Available',
  'val',
  'cor',
  'Recommend',
  'aggravation',
  'Thorough',
  'Asians',
  'Ursul'])

In [32]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [33]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer)) 
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer)) 
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model 
model.resize_token_embeddings(len(tokenizer)) 

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 31495

added_tokens: 2499



Embedding(31495, 768)

Let's call tokenizer_exBERT our tokenizer with the new tokens.

In [34]:
tokenizer_exBERT = tokenizer

In [35]:
# tokenization of the text
tokens = tokenizer_exBERT.tokenize(text)
print(tokens)

['COV', 'ID', '-', '19', 'affec', 't', '##s', 'dif', 'fe', 'rent', 'pe', 'o', 'ple', 'in', 'dif', 'fe', 'rent', 'ways', '.', 'Mo', 'st', 'in', 'fe', 'c', '##ted', 'pe', 'o', 'ple', 'will', 'd', 'ev', 'e', 'lop', 'mil', 'd', 'to', 'mod', 'e', 'ra', 'te', 'ill', '##n', 'ess', 'and', 'rec', 'over', 'without', 'ho', 'sp', 'i', 'tal', 'i', '##zation', '.']


In [36]:
# back to text
tokenizer_exBERT.decode(tokenizer_exBERT.encode(text), skip_special_tokens=True)

'COV ID - 19 affec ts dif fe rent pe o ple in dif fe rent ways. Mo st in fe cted pe o ple will d ev e lop mil d to mod e ra te illn ess and rec over without ho sp i tal ization.'

**The tokenizer continues to fail!**

**It means that we must improve the new tokens list by taking out as well the subwords that begin a word (ie, they don't start by ##).**

In [37]:
# tokenization of the words COVID and hospitalization
print(tokenizer_exBERT.tokenize('COVID'))
print(tokenizer_exBERT.tokenize('hospitalization'))

['COV', 'ID']
['ho', 'sp', 'i', 'tal', 'i', '##zation']


## [ Third test ] Add new tokens (only words, not subwords) into the tokenizer vocab

Let's add only the new tokens that are words, not subwords (that do not start with ## or do not are followed by a subword with ##) in the vocabulary of the original BERT tokenizer.

### 1) Let's use a word tokenizer (spaCY) to find the most frequent words of our corpus by using scikit-learn

**Yes but how?** Let's use a **words tokenizer like spaCY** to find the most frequent words of our corpus instead of a WordPiece tokenizer which generates subwords as well.

**Observation**: here, the expression "most frequent words" means: the tokens present in most of the documents.

In [38]:
import spacy
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt

In [39]:
# initialize our tokenizer with the English spaCY one
nlp = spacy.load("en_core_web_sm", exclude=['morphologizer', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [40]:
def spacy_tokenizer(document, nlp=nlp):
    # tokenize the document with spaCY
    doc = nlp(document)
    # Remove stop words and punctuation symbols
    tokens = [
        token.text for token in doc if (
        token.is_stop == False and \
        token.is_punct == False and \
        token.text.strip() != '' and \
        token.text.find("\n") == -1)]
    return tokens

def dfreq(idf, N):
    return (1+N) / np.exp(idf - 1) - 1

In [41]:
%%time
# https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
tfidf_vectorizer = TfidfVectorizer(lowercase=False, tokenizer=spacy_tokenizer, 
                                   norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
# parse matrix of tfidf
docs = documents
length = len(docs)
result = tfidf_vectorizer.fit_transform(docs)
# print(result.shape)

# idf
idf = tfidf_vectorizer.idf_

# sorted idf, tokens and docs frequencies
idf_sorted_indexes = sorted(range(len(idf)), key=lambda k: idf[k])
idf_sorted = idf[idf_sorted_indexes]
tokens_by_df = np.array(tfidf_vectorizer.get_feature_names())[idf_sorted_indexes]
dfreqs_sorted = dfreq(idf_sorted, length).astype(np.int32)
tokens_dfreqs = {tok:dfreq for tok, dfreq in zip(tokens_by_df,dfreqs_sorted)}
tokens_pct_list = [int(round(dfreq/length*100,2)) for token,dfreq in tokens_dfreqs.items()]

CPU times: user 4.37 s, sys: 47.9 ms, total: 4.42 s
Wall time: 4.42 s


In [42]:
# we have only 2 documents (that's why we range the intervale [1,101] with a step of 50)
number_tokens_with_DF_above_pct = list()
for pct in range(1,101,50):
    index_max = len(np.array(tokens_pct_list)[np.array(tokens_pct_list)>=pct])
    number_tokens_with_DF_above_pct.append(index_max)

In [43]:
# DF = Document Frequency

# df_docfreqs = pd.DataFrame(number_tokens_with_DF_above_pct, columns=['number of tokens with DF above x%'])
# df_docfreqs.index += 1 
# df_docfreqs.transpose()

# plt.plot(number_tokens_with_DF_above_pct)
# plt.title(f'Document Frequency above of {pct}%')
# plt.show()

df_docfreqs = pd.DataFrame({'pct':list(range(1,101,50)),'number of tokens with DF above pct%':number_tokens_with_DF_above_pct})
df_docfreqs.transpose()

Unnamed: 0,0,1
pct,1,51
number of tokens with DF above pct%,4195,1080


**There are 4186 words which appear in one or two documents from our 2 documents list, and 1058 which are in the 2 documents.**

**Let's consider that the 4186 words are all important and relevant to our COVID corpus.**

**Observation**: within a corpus with more documents, we could have used another rule as for example: keeping only words which are at least in 10% of the documents list.

### Get the vocabulary that is not in the original BERT tokenizer

This step is not necessary, as the `tokenizer.add_tokens()` method will add new tokens only if they do not belong to the existing tokenizer vocabulary. However, it helps us to see what these new tokens are.

In [44]:
# list of new tokens
pct = 1
index_max = len(np.array(tokens_pct_list)[np.array(tokens_pct_list)>=pct])
new_tokens = tokens_by_df[:index_max]
# print(len(new_tokens))

old_vocab = [k for k,v in tokenizer.get_vocab().items()]
new_vocab = [token for token in new_tokens]
idx_old_vocab_list = list()
same_tokens_list = list()
different_tokens_list = list()

for idx_new,w in enumerate(new_vocab): 
  try:
    idx_old = old_vocab.index(w)
  except:
    idx_old = -1
  if idx_old>=0:
      idx_old_vocab_list.append(idx_old)
      same_tokens_list.append((w,idx_new))
  else:
      different_tokens_list.append((w,idx_new))

In [45]:
len(same_tokens_list),len(different_tokens_list),len(same_tokens_list)+len(different_tokens_list)

(3987, 208, 4195)

**We found 226 tokens (whole words) that are not in the vocabulary of the original tokenizer, and the words COVID and hospitalization belong to the new tokens list.**

In [46]:
# get list of new tokens
new_tokens = [k for k,v in different_tokens_list]
print(len(new_tokens), new_tokens[:20])

208 ['0.002', '0.01', '0.1', '0.4', '0.5', '1.4', '1.7', '2.1', '3.4', '4.6', '50,000', '6,174', 'B.1.1.7', 'B.1.351', 'COVID-19', 'CoV-2', 'CoV.', 'P.1', 'U.S.', 'U07.1']


In [47]:
"COVID" in new_tokens, "hospitalization" in new_tokens

(False, True)

### Add the new tokens (only whole words, not subwords!) in the vocabulary of the original BERT tokenizer

In [48]:
# import model and tokenizer
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [49]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer)) 
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer)) 
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model 
model.resize_token_embeddings(len(tokenizer)) 

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 29204

added_tokens: 208



Embedding(29204, 768)

Let's call tokenizer_exBERT our tokenizer with the new tokens.

In [50]:
tokenizer_exBERT = tokenizer

In [51]:
# tokenization of the text
tokens = tokenizer_exBERT.tokenize(text)
print(tokens)

['COVID-19', 'affects', 'different', 'people', 'in', 'different', 'ways', '.', 'Most', 'infected', 'people', 'will', 'develop', 'mild', 'to', 'moderate', 'illness', 'and', 'recover', 'without', 'hospitalization', '.']


In [52]:
# back to text
tokenizer_exBERT.decode(tokenizer_exBERT.encode(text), skip_special_tokens=True)

'COVID-19 affects different people in different ways. Most infected people will develop mild to moderate illness and recover without hospitalization.'

**The tokenizer with the new tokens (only whole words!) did succeed in tokenizing the words COVID and hospitalization correctly (and not only these ones: all of them!)**

**It means that is fundamental to add new tokens that are only whole words to an existing subword tokenizer like WordPiece, and not subwords!**

In [53]:
# tokenization of the words COVID and hospitalization
print(tokenizer_exBERT.tokenize('COVID'))
print(tokenizer_exBERT.tokenize('hospitalization'))

['COVID']
['hospitalization']


## Let's check the impact of our enriched tokenizer

Let's use a text about COVID taken from a newspaper site (not from Wikipedia).

In [54]:
# source: https://edition.cnn.com/2021/04/05/health/us-coronavirus-monday/index.html
text = 'Experts say Covid-19 vaccinations in the US are going extremely well -- but not enough people are protected yet and the country may be at the start of another surge. \
The US reported a record over the weekend with more than 4 million Covid-19 vaccine doses administered in 24 hours, according to the Centers for Disease Control and Prevention. \
And the country now averages more than 3 million doses daily, according to CDC data. \
But only about 18.5% of Americans are fully vaccinated, CDC data shows, and Covid-19 cases in the country have recently seen concerning increases. \
"I do think we still have a few more rough weeks ahead," Dr. Celine Gounder, an infectious diseases specialist and epidemiologist, told CNN on Sunday. \
"What we know from the past year of the pandemic is that we tend to trend about three to four weeks behind Europe in terms of our pandemic patterns."'

Now, let's tokenize this text both with the original BERT tokenizer and its enriched version.

In [55]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [56]:
tokens = tokenizer.tokenize(text)
print('number of tokens by the original BERT tokenizer:', len(tokens))

tokens = tokenizer_exBERT.tokenize(text)
print('number of tokens by the enriched tokenizer:', len(tokens))

number of tokens by the original BERT tokenizer: 203
number of tokens by the enriched tokenizer: 194


**As expected, we find that the enriched tokenizer needs less tokens (here, 5%) to tokenize the text on COVID than the original BERT tokenizer.**

## To be continued...

Now that we have augmented our tokenizer vocabulary with words specific to our corpus, we need to fine-tune the natural language model it is associated with (here, the bert-base-cased model). Indeed, the addition of new words led to the increase of the matrix of embeddings of the model by the same number: **with each new word added, a new vector of embeddings with random values was added as well** thanks to the `model.resize_token_embeddings(len(tokenizer))` method. So we need to train (or fine-tune) our model on our body so that the model can learn the embeddings of these new words.

Hugging Face provided a script and a notebook to fine tune a natural language model on a new corpus (*How to fine-tune a model on language modeling*: [script](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | [github](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) | [colab](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)). **We therefore have a ready-to-use code. However, it is possible that this code is not adapted to your situation** because if the number of new words (and therefore of new embeddings vectors) is high, it is possible that the training by this code leads to a Catastrophic Forgetting by modifying in a sensitive way the vectors of embeddings of the tokens of the initial vocabulary.

**My advice**: do a Google search with this type of "*fine-tune a pre-trained model for a specific domain*" query. You will get all the interesting articles and documents on this topic. Good job to you!

# END